Welcome to CluSim’s documentation!
Installation¶
This package is available in PyPI. Just run the following command on terminal to install.
>>> pip install clusim
You can also source the code directly from the github [project page](https://github.com/Hoosier-Clusters/clusim).
Usage¶
A first comparison¶
We start by importing the required modules
>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim
The simplest way to make a Clustering is to use an elm2clu_dict which maps each element.
>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> print_clustering(c1)
>>> 01|23|45
>>> print_clustering(c2)
>>> 0|123|45
Finally, the similarity of the two Clusterings can be found using the Jaccard Index.
>>> sim.jaccard_index(c1, c2)
>>> 0.4
Basics of element-centric similarity¶
>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim
>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})
The basic element-centric similarity score with a fixed alpha:
>>> sim.element_sim(c1, c2, alpha = 0.9)
>>> 0.6944444444444443
We can also get the element scores. Note that since non-numberic elements are allowed, the element scores also returns a dict which maps the elements to the index in the elementScore array.
>>> elementScores, relabeled_elements = sim.element_sim_elscore(c1, c2, alpha = 0.9)
>>> print(elementScores)
>>> [0.5 0.33333333 0.66666667 0.66666667 1. 1. ]
The above element-centric similarity scores can be (roughly) interpreted as follows: 1. cluster 2 has the same memberships between the clusterings, so elements 4 and 5 have an element-centric similarity of 1.0 2. cluster 0 has one element difference between the clusterings (element 1 moved from cluster 0 to cluster 1), so element 0 has an element-centric similarity of 1/2 3. cluster 1 has one element difference between the clusterings (element 1 moved from cluster 0 to cluster 1), so elements 2 and 3 have an element-centric similarity of 2/3 4. element 1 moved from cluster 0 to cluster 1 so it has an element-centric similarity of 1/3
Additional CluSim examples¶
Many more examples can be found in the jupyter notebooks included with the package: 1. Using Similarity Measures 2. Adjusting The Rand Index for different random models 3. Adjusting Normalized Mutual Information 4. Basics of Elementcentric Similarity 5. Valid Clusterings 6. Application with SciKit Learn 7. Application with Hierarchical Clustering
The “Clustering”¶
-
class
clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)[source]¶ Base class for clusterings.
- Parameters
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
Methods
clustering_from_igraph_cover(self, igraphcover)This method creates a clustering from an igraph VertexCover object.
copy(self)Return a deep copy of the clustering.
downstream_elements(self, cluster)This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
find_clu_size_seq(self)This method finds the cluster size sequence for the clustering.
find_num_overlap(self)This method finds the number of elements which are in more than one cluster in the clustering.
from_clu2elm_dict(self, clu2elm_dict)This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, …
from_cluster_list(self, cluster_list)This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], …
from_dict(self, clustering_dict)This method creates a Custering object from a dictionary.
from_digraph(self, hier_graph[, …])This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’.
from_elm2clu_dict(self, elm2clu_dict)This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, …
from_membership_list(self, membership_list)This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, …
from_scipy_linkage(self, linkage_matrix[, …])This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering.
load(self, file)Load the Custering from a file using json.
merge_clusters(self, c1, c2[, new_name])This method merges the elements in two clusters from the clustering.
relabel_clusters_by_size(self)This method renames all clusters by their size.
relabel_clusters_to_match(self, …)This method renames all clusters to have maximal overlap to the ‘target_clustering’.
save(self, file)Save the Custering to a file using json.
to_clu2elm_dict(self)Create a clu2elm_dict: {clusterid: [el1, el2, …
to_cluster_list(self)This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], …
to_dict(self)This method turns a Custering object into a dictionary.
to_elm2clu_dict(self)Create a elm2clu_dict: {elementid: [clu1, clu2, …
to_membership_list(self)This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, …
validate_clustering(self)This method checks that the clustering is valid, else raise the appropriate Cluster Error.
cut_at_depth
empty_start
hier_clusdict
to_dendropy_tree
-
copy(self)[source]¶ Return a deep copy of the clustering.
- Returns
deep copy of the clustering
>>> from clusim.clustering import Clustering, print_clustering >>> clu = clusim.Clustering() >>> clu2 = clu.copy() >>> print_clustering(clu) >>> print_clustering(clu)
-
validate_clustering(self)[source]¶ This method checks that the clustering is valid, else raise the appropriate Cluster Error.
-
from_elm2clu_dict(self, elm2clu_dict)[source]¶ This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.
- Parameters
elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering() >>> clu.from_elm2clu_dict(elm2clu_dict) >>> print_clustering(clu)
-
from_clu2elm_dict(self, clu2elm_dict)[source]¶ This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.
- Parameters
clu2elm_dict (dict) – { clusid: [el1, el2, … ] }
>>> from clusim.clustering import Clustering, print_clustering >>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]} >>> clu = Clustering() >>> clu.from_clu2elm_dict(clu2elm_dict) >>> print_clustering(clu)
-
from_cluster_list(self, cluster_list)[source]¶ This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists or a list of sets, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.
- Parameters
cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]
>>> from clusim.clustering import Clustering, print_clustering >>> cluster_list = [ [0,1,2], [2,3], [4,5]] >>> clu = Clustering() >>> clu.from_cluster_list(cluster_list) >>> print_clustering(clu)
-
to_cluster_list(self)[source]¶ This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.
- Returns
cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]
-
from_membership_list(self, membership_list)[source]¶ This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.
Note
Membership Lists can only represent partitions (no overlaps)
- Parameters
membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]
>>> from clusim.clustering import Clustering, print_clustering >>> membership_list = [0,0,0,1,2,2] >>> clu = Clustering() >>> clu.from_membership_list(membership_list) >>> print_clustering(clu)
-
to_membership_list(self)[source]¶ This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.
Note
Membership Lists can only represent partitions (no overlaps)
- Returns
list of element memberships, [ clu_for_el1, clu_for_el2, … ]
-
clustering_from_igraph_cover(self, igraphcover)[source]¶ This method creates a clustering from an igraph VertexCover object. See the
igraph.Cover.VertexCoverclass. Clustering features are then calculated.- Parameters
igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover
-
to_clu2elm_dict(self)[source]¶ Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.
- Returns
dict
-
to_elm2clu_dict(self)[source]¶ Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.
- Returns
dict
-
find_clu_size_seq(self)[source]¶ This method finds the cluster size sequence for the clustering.
- Returns
list of integers A list where the ith entry corresponds to the size of the ith cluster.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Cluster Size Sequence:", clu.find_clu_size_seq()) * Cluster Size Sequence: [3, 2, 2]
-
relabel_clusters_by_size(self)[source]¶ This method renames all clusters by their size.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Cluster Size Sequence:", clu.find_clu_size_seq()) >>> clu.relabel_clusters_by_size() * Cluster Size Sequence: [3, 2, 2]
-
relabel_clusters_to_match(self, target_clustering)[source]¶ This method renames all clusters to have maximal overlap to the ‘target_clustering’. Is particularly useful for drawing the clusterings.
>>> from clusim.clustering import Clustering, print_clustering >>> clu1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}) >>> print(clu1.to_membership_list()) >>> >>> clu2 = Clustering(elm2clu_dict = {0:[2], 1:[2], 2:[1], 3:[1], 4:[0], 5:[0]}) >>> clu1.relabel_clusters_to_match(clu2) >>> print(clu1.to_membership_list())
-
find_num_overlap(self)[source]¶ This method finds the number of elements which are in more than one cluster in the clustering.
- Returns
The number of elements in at least two clusters.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Overlap size:", clu.find_num_overlap()) * Overlap size: 1
-
merge_clusters(self, c1, c2, new_name=None)[source]¶ This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.
- Returns
self
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print_clustering(clu) >>> clu.merge_clusters(1,2, new_name = 3) >>> print_clustering(clu)
-
to_dict(self)[source]¶ This method turns a Custering object into a dictionary. Intended for use with json to save the Clusering.
- Returns
dictionary
-
from_dict(self, clustering_dict)[source]¶ This method creates a Custering object from a dictionary. Intended for use with json to load the Clusering.
- Param
dict clustering_dict The dictionary represetnation of the Clustering
-
save(self, file)[source]¶ Save the Custering to a file using json.
- Param
str file The name of the file to dump the Clustering json
- Param
io file The python file to dump the Clustering json
-
load(self, file)[source]¶ Load the Custering from a file using json.
- Param
str file The name of the file to load the Clustering json
- Param
io file The python file to load the Clustering json
-
downstream_elements(self, cluster)[source]¶ This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
- Parameters
cluster – the name of the parent cluster
- Returns
element list
-
from_scipy_linkage(self, linkage_matrix, dist_rescaled=False)[source]¶ This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.
- Parameters
linkage_matrix (numpy.matrix) – the linkage matrix from scipy
dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1
>>> from clusim.clustering import Clustering, print_clustering >>> from scipy.cluster.hierarchy import dendrogram, linkage >>> import numpy as np >>> np.random.seed(42) >>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,]) >>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,]) >>> Xdata = np.concatenate((data1, data2), ) >>> Z = linkage(Xdata, 'ward') >>> clu = Clustering() >>> clu.from_scipy_linkage(Z, dist_rescaled=False)
-
from_digraph(self, hier_graph, elm2clu_dict=None, clu2elm_dict=None)[source]¶ This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’. The element membership into (at least) the lowest resolution of the clusters must be specified by either a ‘elm2clu_dict’ or a ‘clu2elm_dict’. The hierarchical clustering memeberships are then propagated through the acyclic digraph. Finally Clustering features are then calculated.
- Parameters
hier_graph (networkx.DiGraph()) – Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
>>> from clusim.clustering import Clustering, print_clustering >>> import networkx as nx >>> >>> G = nx.DiGraph() >>> G.add_edges_from([(0,1), (0,2)]) >>> clu2elm_dict = {1:[0,1,3,4], 2:[5,6,7,8]} >>> clu = Clustering() >>> clu.from_digraph(hier_graph = G, clu2elm_dict = clu2elm_dict)
-
clusim.clusteringerror¶ alias of
clusim.clusteringerror
-
class
clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)[source]¶ Base class for clusterings.
- Parameters
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
Methods
clustering_from_igraph_cover(self, igraphcover)This method creates a clustering from an igraph VertexCover object.
copy(self)Return a deep copy of the clustering.
downstream_elements(self, cluster)This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
find_clu_size_seq(self)This method finds the cluster size sequence for the clustering.
find_num_overlap(self)This method finds the number of elements which are in more than one cluster in the clustering.
from_clu2elm_dict(self, clu2elm_dict)This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, …
from_cluster_list(self, cluster_list)This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], …
from_dict(self, clustering_dict)This method creates a Custering object from a dictionary.
from_digraph(self, hier_graph[, …])This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’.
from_elm2clu_dict(self, elm2clu_dict)This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, …
from_membership_list(self, membership_list)This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, …
from_scipy_linkage(self, linkage_matrix[, …])This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering.
load(self, file)Load the Custering from a file using json.
merge_clusters(self, c1, c2[, new_name])This method merges the elements in two clusters from the clustering.
relabel_clusters_by_size(self)This method renames all clusters by their size.
relabel_clusters_to_match(self, …)This method renames all clusters to have maximal overlap to the ‘target_clustering’.
save(self, file)Save the Custering to a file using json.
to_clu2elm_dict(self)Create a clu2elm_dict: {clusterid: [el1, el2, …
to_cluster_list(self)This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], …
to_dict(self)This method turns a Custering object into a dictionary.
to_elm2clu_dict(self)Create a elm2clu_dict: {elementid: [clu1, clu2, …
to_membership_list(self)This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, …
validate_clustering(self)This method checks that the clustering is valid, else raise the appropriate Cluster Error.
cut_at_depth
empty_start
hier_clusdict
to_dendropy_tree
-
copy(self)[source]¶ Return a deep copy of the clustering.
- Returns
deep copy of the clustering
>>> from clusim.clustering import Clustering, print_clustering >>> clu = clusim.Clustering() >>> clu2 = clu.copy() >>> print_clustering(clu) >>> print_clustering(clu)
-
validate_clustering(self)[source]¶ This method checks that the clustering is valid, else raise the appropriate Cluster Error.
-
from_elm2clu_dict(self, elm2clu_dict)[source]¶ This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.
- Parameters
elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering() >>> clu.from_elm2clu_dict(elm2clu_dict) >>> print_clustering(clu)
-
from_clu2elm_dict(self, clu2elm_dict)[source]¶ This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.
- Parameters
clu2elm_dict (dict) – { clusid: [el1, el2, … ] }
>>> from clusim.clustering import Clustering, print_clustering >>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]} >>> clu = Clustering() >>> clu.from_clu2elm_dict(clu2elm_dict) >>> print_clustering(clu)
-
from_cluster_list(self, cluster_list)[source]¶ This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists or a list of sets, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.
- Parameters
cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]
>>> from clusim.clustering import Clustering, print_clustering >>> cluster_list = [ [0,1,2], [2,3], [4,5]] >>> clu = Clustering() >>> clu.from_cluster_list(cluster_list) >>> print_clustering(clu)
-
to_cluster_list(self)[source]¶ This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.
- Returns
cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]
-
from_membership_list(self, membership_list)[source]¶ This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.
Note
Membership Lists can only represent partitions (no overlaps)
- Parameters
membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]
>>> from clusim.clustering import Clustering, print_clustering >>> membership_list = [0,0,0,1,2,2] >>> clu = Clustering() >>> clu.from_membership_list(membership_list) >>> print_clustering(clu)
-
to_membership_list(self)[source]¶ This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.
Note
Membership Lists can only represent partitions (no overlaps)
- Returns
list of element memberships, [ clu_for_el1, clu_for_el2, … ]
-
clustering_from_igraph_cover(self, igraphcover)[source]¶ This method creates a clustering from an igraph VertexCover object. See the
igraph.Cover.VertexCoverclass. Clustering features are then calculated.- Parameters
igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover
-
to_clu2elm_dict(self)[source]¶ Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.
- Returns
dict
-
to_elm2clu_dict(self)[source]¶ Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.
- Returns
dict
-
find_clu_size_seq(self)[source]¶ This method finds the cluster size sequence for the clustering.
- Returns
list of integers A list where the ith entry corresponds to the size of the ith cluster.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Cluster Size Sequence:", clu.find_clu_size_seq()) * Cluster Size Sequence: [3, 2, 2]
-
relabel_clusters_by_size(self)[source]¶ This method renames all clusters by their size.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Cluster Size Sequence:", clu.find_clu_size_seq()) >>> clu.relabel_clusters_by_size() * Cluster Size Sequence: [3, 2, 2]
-
relabel_clusters_to_match(self, target_clustering)[source]¶ This method renames all clusters to have maximal overlap to the ‘target_clustering’. Is particularly useful for drawing the clusterings.
>>> from clusim.clustering import Clustering, print_clustering >>> clu1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}) >>> print(clu1.to_membership_list()) >>> >>> clu2 = Clustering(elm2clu_dict = {0:[2], 1:[2], 2:[1], 3:[1], 4:[0], 5:[0]}) >>> clu1.relabel_clusters_to_match(clu2) >>> print(clu1.to_membership_list())
-
find_num_overlap(self)[source]¶ This method finds the number of elements which are in more than one cluster in the clustering.
- Returns
The number of elements in at least two clusters.
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print("Overlap size:", clu.find_num_overlap()) * Overlap size: 1
-
merge_clusters(self, c1, c2, new_name=None)[source]¶ This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.
- Returns
self
>>> from clusim.clustering import Clustering, print_clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]} >>> clu = Clustering(elm2clu_dict = elm2clu_dict) >>> print_clustering(clu) >>> clu.merge_clusters(1,2, new_name = 3) >>> print_clustering(clu)
-
to_dict(self)[source]¶ This method turns a Custering object into a dictionary. Intended for use with json to save the Clusering.
- Returns
dictionary
-
from_dict(self, clustering_dict)[source]¶ This method creates a Custering object from a dictionary. Intended for use with json to load the Clusering.
- Param
dict clustering_dict The dictionary represetnation of the Clustering
-
save(self, file)[source]¶ Save the Custering to a file using json.
- Param
str file The name of the file to dump the Clustering json
- Param
io file The python file to dump the Clustering json
-
load(self, file)[source]¶ Load the Custering from a file using json.
- Param
str file The name of the file to load the Clustering json
- Param
io file The python file to load the Clustering json
-
downstream_elements(self, cluster)[source]¶ This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
- Parameters
cluster – the name of the parent cluster
- Returns
element list
-
from_scipy_linkage(self, linkage_matrix, dist_rescaled=False)[source]¶ This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.
- Parameters
linkage_matrix (numpy.matrix) – the linkage matrix from scipy
dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1
>>> from clusim.clustering import Clustering, print_clustering >>> from scipy.cluster.hierarchy import dendrogram, linkage >>> import numpy as np >>> np.random.seed(42) >>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,]) >>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,]) >>> Xdata = np.concatenate((data1, data2), ) >>> Z = linkage(Xdata, 'ward') >>> clu = Clustering() >>> clu.from_scipy_linkage(Z, dist_rescaled=False)
-
from_digraph(self, hier_graph, elm2clu_dict=None, clu2elm_dict=None)[source]¶ This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’. The element membership into (at least) the lowest resolution of the clusters must be specified by either a ‘elm2clu_dict’ or a ‘clu2elm_dict’. The hierarchical clustering memeberships are then propagated through the acyclic digraph. Finally Clustering features are then calculated.
- Parameters
hier_graph (networkx.DiGraph()) – Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
>>> from clusim.clustering import Clustering, print_clustering >>> import networkx as nx >>> >>> G = nx.DiGraph() >>> G.add_edges_from([(0,1), (0,2)]) >>> clu2elm_dict = {1:[0,1,3,4], 2:[5,6,7,8]} >>> clu = Clustering() >>> clu.from_digraph(hier_graph = G, clu2elm_dict = clu2elm_dict)
-
class
clusim.clustering.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Methods
default(self, obj)Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).encode(self, o)Return a JSON string representation of a Python data structure.
iterencode(self, o[, _one_shot])Encode the given object and yield each string representation as available.
-
default(self, obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
-
clusim.clustering.print_clustering(clustering)[source]¶ A function to print a clustering. Clusters are seperated by ‘|’. The fuction will only print the leaf layer of a Hierarchical Clustering.
- Parameters
clustering (Clsutering) – The clustering to print
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3) >>> print_clustering(clu)
-
clusim.clustering.remap2match(clustering1, clustering2)[source]¶ Renumber membership assignments so that the first clustering has maximal overlap to the second clustering. Useful for drawing consistend pictures.
Only works with partitions.
For example:
>>> print(remap2match([3,3,1,1,0],[2,2,3,3,3])) [2 2 3 3 4]
- Parameters
clustering1 (Clustering) – clustering to remap
clustering2 (Clustering) – clustering to match (treated as the groundtruth)
- Returns
Remapped assignment of clusters from clustering1 to an equivalent label from clustering2
- Return type
dict
Clustering Generation¶
-
clusim.clugen.make_equal_clustering(n_elements, n_clusters)[source]¶ This function creates a random clustering with equally sized clusters. If n_elements % n_clusters != 0, cluster sizes will differ by one element.
- Parameters
n_elements (int) – The number of elements
n_clusters (int) – The number of clusters
- Returns
The new clustering with equally sized clusters.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3) >>> print_clustering(clu)
-
clusim.clugen.make_random_clustering(n_elements=1, n_clusters=1, clu_size_seq=[1, 2], random_model='all', tol=1e-15)[source]¶ This function creates a random clustering according to one of three random models. It is a wrapper around the specific functions for each random model.
- Parameters
n_elements (int) – The number of elements
n_clusters (int) – The number of clusters
random_mode (str) –
The random model to use:
- ’all’uniform distrubtion over the set of all clusterings of
n_elements
- ’num’uniform distrubtion over the set of all clusterings of
n_elements in n_clusters
’perm’ : the Permutation Model
tol (float) – optional The tolerance used by the algorithm for ‘all’ clusterings
- Returns
The new clustering.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3, random_model = 'num') >>> print_clustering(clu)
-
clusim.clugen.cluster_missing_elements(element_list, elm2clu_dict, new_cluster_type='singleton')[source]¶ Sometimes a clustering algorithm does not assign every element to a cluster. This function adds the missing elements to their own cluster(s).
- Parameters
element_list (list) – The complete list of elements
elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
new_cluster_type (str) –
The new type of clusters to use:
’singleton’ : each unassigned element is put into its own singleton cluster
’giant’ : all unassigned elements are put into a single giant cluster
- Returns
The new elm2clu_dict.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering
>>> clu = clugen.make_random_clustering(n_elements = 7) >>> print_clustering(clu) >>> clu = clugen.cluster_missing_elements(element_list = list(range(10)), elm2clu_dict = clu, new_cluster_type='singleton') >>> print_clustering(clu)
-
clusim.clugen.make_singleton_clustering(n_elements)[source]¶ This function creates a clustering with each element in its own cluster.
- Parameters
n_elements (int) – The number of elements
- Returns
The new clustering.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> clu = clugen.make_singleton_clustering(n_elements = 9) >>> print_clustering(clu)
-
clusim.clugen.make_random_dendrogram(n_elements)[source]¶ This function creates a random Hierarchical Clustering.
:param int n_elements The number of elements
- Returns
The new clustering.
-
clusim.clugen.shuffle_memberships(clustering, percent=1.0)[source]¶ This function creates a new clustering by shuffling the element memberships from the original clustering.
- Parameters
clustering (Clustering) – The original clustering.
percent (float) – optional (default 1.0) The fractional percentage (between 0.0 and 1.0) of the elements to shuffle.
- Returns
The new clustering.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> orig_clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3, random_model = 'num') >>> print_clustering(orig_clu) >>> shuffle_clu = clugen.shuffle_memberships(orig_clu, percent = 0.5) >>> print_clustering(shuffle_clu)
-
clusim.clugen.shuffle_memberships_pa(clustering, n_steps=1, constant_num_clusters=True)[source]¶ This function creates a new clustering by shuffling the element memberships from the original clustering according to the preferential attachment model.
See [GA17] for a detailed explaination of the preferential attachment model.
- Parameters
clustering (Clustering) – The original clustering.
n_steps (int) – optional (default 1) The number of times to run the preferential attachment algorithm.
constant_num_clusters (Boolean) – optional (default True) Reject a shuffling move if it leaves a cluster with no elements. Set to True to keep the number of clusters constant.
- Returns
The new clustering with shuffled memberships.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> orig_clu = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print_clustering(orig_clu) >>> shuffle_clu = clugen.shuffle_memberships_pa(orig_clu, n_steps=10, constant_num_clusters=True) >>> print_clustering(shuffle_clu)
-
clusim.clugen.generate_random_partition_all(n_elements, tol=1e-15)[source]¶ This function creates a random clustering according to the ‘All’ random model by uniformly selecting a clustering from the ensemble of all clusterings with n_elements.
- Parameters
n_elements (int) – The number of elements
tol (float) – (optional) The tolerance used by the algorithm to approximate the probability distrubtion
- Returns
The randomly genderated clustering.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> clu = clugen.generate_random_partition_all(n_elements = 9) >>> print_clustering(clu)
-
clusim.clugen.enumerate_random_partition_num(n_elements, n_clusters)[source]¶ A generator for every partition in ‘Num’, the ensemble of all clusterings with n_elements grouped into n_clusters, non-empty clusters.
Based on the solution provided by Adeel Zafar Soomro: a link.
which was itself based on the algorithm from Knuth: (Algorithm U) is described by Knuth in the Art of Computer Programming, Volume 4, Fascicle 3B
- Parameters
n_elements (int) – The number of elements
n_clusters (int) – The number of clusters
- Returns
The new clustering as a cluster list.
>>> import clusim.clugen as clugen >>> from clusim.clustering import print_clustering >>> for clu in clugen.clustering_ensemble_generator_num(n_elements=5, n_clusters=3): >>> print_clustering(clu)
Clustering Similarity¶
The different clustering similarity measures available.
Pairwise Counting Measures¶
-
clusim.sim.contingency_table(clustering1, clustering2)[source]¶ This function creates the contingency table between two clusterings.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The clustering1.n_clusters by clustering2.n_clusters contingency table as a list of lists
>>> import clusim.clugen as clugen >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model = 'num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model = 'num') >>> cont_table = contingency_table(clustering1, clustering2) >>> print(cont_table)
-
clusim.sim.count_pairwise_cooccurence(clustering1, clustering2)[source]¶ This function finds the pairwise cooccurence counts between two clusterings.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
- N11int
The number of element pairs assigned to the same clusters in both clusterings
- N10int
The number of element pairs assigned to the same clusters in clustering1, but different clusters in clustering2
- N01int
The number of element pairs assigned to different clusters in clustering1, but the same clusters in clustering2
- N00int
The number of element pairs assigned to different clusters in both clusterings
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> from clusim.clustering import print_clustering >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> N11, N10, N01, N00 = sim.count_pairwise_cooccurence(clustering1, clustering2) >>> print_clustering(clustering1) >>> print_clustering(clustering2) >>> print(N11, "element pairs assigned to the same clusters in both clusterings") >>> print(N10, "element pairs assigned to the same clusters in clustering1, but " "different clusters in clustering2") >>> print(N01, "element pairs assigned to different clusters in " "clustering1, but the same clusters in clustering2") >>> print(N00, "element pairs assigned to different clusters in both " "clusterings")
-
clusim.sim.jaccard_index(clustering1, clustering2)[source]¶ This function calculates the Jaccard index between two clusterings [Jac12].
J = N11/(N11+N10+N01)
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Jaccard index (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen..make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.jaccard_index(clustering1, clustering2) )
-
clusim.sim.rand_index(clustering1, clustering2)[source]¶ This function calculates the Rand index between two clusterings [Ran71].
RI = (N11 + N00) / (N11 + N10 + N01 + N00)
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Rand index (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen..make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.rand_index(clustering1, clustering2))
-
clusim.sim.fowlkes_mallows_index(clustering1, clustering2)[source]¶ This function calculates the Fowlkes and Mallows index between two clusterings [FM83].
FM = N11 / sqrt( (N11 + N10) * (N11 + N01) )
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Fowlkes and Mallows index (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.fowlkes_mallows_index(clustering1, clustering2))
-
clusim.sim.fmeasure(clustering1, clustering2)[source]¶ This function calculates the F-measure between two clusterings.
Also known as: Czekanowski index Dice Symmetric index Sorensen index
F = 2*N11 / (2*N11 + N10 + N01)
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The F-measure (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.fmeasure(clustering1, clustering2))
-
clusim.sim.purity_index(clustering1, clustering2)[source]¶ This function calculates the Purity index between two clusterings.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Purity index (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.purity_index(clustering1, clustering2))
-
clusim.sim.classification_error(clustering1, clustering2)[source]¶ This function calculates the Jaccard index between two clusterings.
CE = 1 - PI
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Classification Error (between 0.0 and 1.0)
Note
CE is a distance measure, it is 0 for identical clusterings
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.classification_error(clustering1, clustering2))
-
clusim.sim.czekanowski_index(clustering1, clustering2)[source]¶ This function calculates the Czekanowski index between two clusterings.
See Fmeasure
-
clusim.sim.dice_index(clustering1, clustering2)[source]¶ This function calculates the Dice index between two clusterings.
See Fmeasure
-
clusim.sim.sorensen_index(clustering1, clustering2)[source]¶ This function calculates the Sorensen index between two clusterings.
See Fmeasure
-
clusim.sim.rogers_tanimoto_index(clustering1, clustering2)[source]¶ This function calculates the Rogers and Tanimoto index between two clusterings.
RT = (N11 + N00)/(N11 + 2*(N10+N01) + N00)
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Rogers and Tanimoto index (between 0.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.rogers_tanimoto_index(clustering1, clustering2))
-
clusim.sim.southwood_index(clustering1, clustering2)[source]¶ calculate the southwood index
N11 / (N10 + N01)
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Southwood index (between 0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.southwood_index(clustering1, clustering2))
-
clusim.sim.pearson_correlation(clustering1, clustering2)[source]¶ This function calculates the Pearson Correlation between two clusterings.
PC = (N11*N00 - N01*N10) / ((N11+N10) * (N11+N01) * (N00+N10) * (N00+N01))
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Pearson Correlation (between -1.0 and 1.0)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.pearson_correlation(clustering1, clustering2))
Information Theoretic Measures¶
-
clusim.sim.mi(clustering1, clustering2)[source]¶ This function calculates the Mutual Information (MI) between two clusterings [DDiazGDA05].
MI = (S(c1) + S(c2) - S(c1, c2))
where S(c1) is the Shannon Entropy of the clustering size distribution, S(c1, c2) is the Shannon Entropy of the join clustering size distribution,
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
The Mutual Information (between 0.0 and inf)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.mi(clustering1, clustering2))
-
clusim.sim.nmi(clustering1, clustering2, norm_type='sum')[source]¶ This function calculates the Normalized Mutual Information (NMI) between two clusterings [DDiazGDA05].
NMI = (S(c1) + S(c2) - S(c1, c2)) / norm(c1, c2)
where S(c1) is the Shannon Entropy of the clustering size distribution, S(c1, c2) is the Shannon Entropy of the join clustering size distribution, and norm(c1,c2) is a normalization term.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
- Returns
The Normalized Mutual Information index (between 0.0 and inf)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.nmi(clustering1, clustering2, norm_type='sum')) >>> print(sim.nmi(clustering1, clustering2, norm_type='max')) >>> print(sim.nmi(clustering1, clustering2, norm_type='min')) >>> print(sim.nmi(clustering1, clustering2, norm_type='sqrt'))
-
clusim.sim.vi(clustering1, clustering2, norm_type='none')[source]¶ This function calculates the Variation of Information (VI) between two clusterings [Mei03].
VI is technically a distance measure and can assume values in the range [0, inf), where 0 denotes identical clusterings.
VI = 2*S(c1, c2) - S(c1) - S(c2)
where S(c1) is the Shannon Entropy of the clustering size distribution, and S(c1, c2) is the Shannon Entropy of the join clustering size distribution.
The VI can be transformed into a clustering similarity measure via the appropriate normalization.
VI_{sim} = 1 - 0.5*((S(c1,c2) - S(c1))/S(c2) + (S(c1,c2) - S(c2))/S(c1))
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- norm_type‘none’ (default) or ‘entropy’
The normalization type. ‘none’ returns the standard VI as a distance metric, ‘entropy’ retuns the normalized VI as a similarity measure
- Returns
The Variation of Information index (between 0.0 and inf)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.vi(clustering1, clustering2))
-
clusim.sim.rmi(clustering1, clustering2, norm_type='none', logbase=2)[source]¶ This function calculates the Reduced Mutual Information (RMI) between two clusterings [NCY19].
RMI = MI(c1, c2) - log Omega(a, b) / n
where MI(c1, c2) is mutual information of the clusterings c1 and c2, and where Omega(a, b) is the number of contingency tables with row and column sums equal to a and b.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
norm_type (str) – ‘none’ (default) The normalization types are: ‘none’ returns the RMI without a normalization. ‘normalized’ returns the RMI with upper bound equals to 1.
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).
- Returns
The Reduced Mutual Information index (between 0.0 and inf)
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='num') >>> print(sim.rmi(clustering1, clustering2, norm_type='none')) >>> print(sim.rmi(clustering1, clustering2, norm_type='normalized'))
Correction for Chance¶
-
clusim.sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm', norm_type='sum', n_samples=100)[source]¶ This function calculates the adjusted Similarity for one of six random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
measure (str) – The similarity measure to evaluate. Must be one of the available_similarity_measures.
random_model (str) –
The random model to use:
- ’all’uniform distribution over the set of all clusterings of
n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type used if the measure is ‘nmi’ ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
- n_samplesint
The number of random Clusterings sampled to determine the expected similarity.
- Returns
The adjusted Similarity measure
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='all')) >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='all1')) >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='num')) >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='num1')) >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm')) >>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm1'))
-
clusim.sim.sample_expected_sim(clustering1, clustering2, measure='jaccard_index', random_model='perm', n_samples=1, keep_samples=False)[source]¶ This function calculates the expected Similarity for all pair-wise comparisons between Clusterings drawn from one of six random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
measure (str) – The similarity measure to evaluate. Must be one of the measures listed in sim.available_similarity_measures.
random_model (string) –
The random model to use:
- ’all’uniform distribution over the set of all clusterings of
n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
n_samples (int) – The number of random Clusterings sampled to determine the expected similarity.
keep_samples (bool) – If True, returns the Similarity samples themselves, otherwise return their mean.
- Returns
The expected Similarity measure for all pair-wise comparisons under a random model
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> c1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> c2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all', n_samples=50)) >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all1', n_samples=50)) >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num', n_samples=50)) >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num1', n_samples=50)) >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm', n_samples=50)) >>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm1', n_samples=50) )
-
clusim.sim.expected_rand_index(n_elements, random_model='num', n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None)[source]¶ This function calculates the expectation of the Rand index between all pairs of clusterings drawn from one of six random models.
See [HA85] and [GA17] for a detailed derivation and explanation of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
n_elements (int) – The number of elements
random_model (str) –
The random model to use:
- ’all’uniform distribution over the set of all clusterings of
n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
n_clusters1 (int) – optional The number of clusters in the first clustering
n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expectations
clu_size_seq1 (list) – optional The cluster size sequence of the first clustering as a list of ints
clu_size_seq2 (list) – optional The cluster size sequence of the second clustering as a list of ints
- Returns
The expected Rand index (between 0.0 and 1.0)
>>> import clusim.sim as sim >>> print(sim.expected_rand_index(n_elements=5, random_model='all')) >>> print(sim.expected_rand_index(n_elements=5, random_model='all1', clu_size_seq2=[1,1,3])) >>> print(sim.expected_rand_index(n_elements=5, , random_model='num', n_clusters1=2, n_clusters2=3)) >>> print(sim.expected_rand_index(n_elements=5, random_model='num1', n_clusters1=2, clu_size_seq2=[1,1,3])) >>> print(sim.expected_rand_index(n_elements=5, random_model='perm', clu_size_seq1=[2,3], clu_size_seq2=[1,1,3])) >>> print(sim.expected_rand_index(n_elements=5, random_model='perm1', clu_size_seq1=[2,3], clu_size_seq2=[1,1,3]))
-
clusim.sim.adjrand_index(clustering1, clustering2, random_model='perm')[source]¶ This function calculates the adjusted Rand index for one of six random models.
See [HA85] and [GA17] for a detailed derivation and explanation of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
random_model (str) –
The random model to use:
- ’all’uniform distribution over the set of all clusterings of
n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
- Returns
The adjusted_rand Rand index
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all') >>> print(sim.adjrand_index(clustering1, clustering2, random_model='all')) >>> print(sim.adjrand_index(clustering1, clustering2, random_model='all1')) >>> print(sim.adjrand_index(clustering1, clustering2, random_model='num')) >>> print(sim.adjrand_index(clustering1, clustering2, random_model='num1')) >>> print(sim.adjrand_index(clustering1, clustering2, random_model='perm')) >>> print(sim.adjrand_index(clustering1, clustering2, random_model='perm1'))
-
clusim.sim.adj_mi(clustering1, clustering2, random_model='perm', norm_type='sum', logbase=2)[source]¶ This function calculates the adjusted Mutual Information for one of six random models.
See [GA17] for a detailed derivation and explanation of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
random_model (string) –
The random model to use:
’all’ : uniform distribution over the set of all clusterings of n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).
- Returns
The adjusted Mutual Information
>>> import clusim.clugen as clugen >>> import clusim.sim as sim >>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all') >>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all') >>> print(sim.adj_mi(clustering1, clustering2, random_model='all')) >>> print(sim.adj_mi(clustering1, clustering2, random_model='all1')) >>> print(sim.adj_mi(clustering1, clustering2,random_model='num')) >>> print(sim.adj_mi(clustering1, clustering2,random_model='num1')) >>> print(sim.adj_mi(clustering1, clustering2,random_model='perm')) >>> print(sim.adj_mi(clustering1, clustering2,random_model='perm1'))
-
clusim.sim.expected_mi(n_elements, n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None, logbase=2, random_model='num')[source]¶ This function calculates the expectation of the Mutual Information between all pairs of clusterings drawn from one of six random models.
See [GA17] for a detailed derivation and explanation of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
- Parameters
n_elements (int) – The number of elements
n_clusters1 (int) – optional The number of clusters in the first clustering
n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expectations
clu_size_seq1 (list) – optional The cluster size sequence of the first clustering as a list of ints.
clu_size_seq2 (list) – optional The cluster size sequence of the second clustering as a list of ints.
random_model (str) –
The random model to use:
- ’all’uniform distribution over the set of all clusterings of
n_elements
- ’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements
- ’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters
- ’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters
’perm’ : the permutation model for a fixed cluster size sequence
- ’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).
- Returns
The expected MI (between 0.0 and inf)
>>> import clusim.sim as sim >>> print(sim.expected_mi(n_elements=5, random_model='all')) >>> print(sim.expected_mi(n_elements=5, random_model='all1', clu_size_seq2=[1,1,3])) >>> print(sim.expected_mi(n_elements=5, , random_model='num', n_clusters1=2, n_clusters2=3)) >>> print(sim.expected_mi(n_elements=5, random_model='num1', n_clusters1=2, clu_size_seq2=[1,1,3])) >>> print(sim.expected_mi(n_elements=5, random_model='perm', clu_size_seq1=[2,3], clu_size_seq2=[1,1,3])) >>> print(sim.expected_mi(n_elements=5, random_model='perm1', clu_size_seq1=[2,3], clu_size_seq2=[1,1,3]))
Overlapping Clustering Similarity¶
-
clusim.sim.onmi(clustering1, clustering2)[source]¶ This function calculates the overlapping normalized mutual information.
See [LFK09] for a detailed derivation and explanation of the measure.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
the overlapping normalized mutual information
-
clusim.sim.omega_index(clustering1, clustering2)[source]¶ This function calculates the omega index between two clusterings.
See [CD88] for a detailed derivation and explanation of the measure.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
the omega index
-
clusim.sim.geometric_accuracy(clustering1, clustering2)[source]¶ This function calculates the geometric accuracy between two (overlapping) clusterings.
See [NYP12] for a detailed derivation and explanation of the measure.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
the geometric accuracy
-
clusim.sim.overlap_quality(clustering1, clustering2)[source]¶ This function calculates the overlap quality between two (overlapping) clusterings.
See [ABL10] for a detailed derivation and explanation of the measure.
- Parameters
clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
- Returns
the overlap quality
Element-centric Clustering Similarity¶
-
clusim.clusimelement.element_sim(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', ppr_implementation='prpack')[source]¶ The element-centric clustering similarity.
See [GWHA19] for a detailed explaination of the measure.
- Parameters
clustering1 (Clustering) – The first Clustering
clustering2 (Clustering) – The second Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
r1 (float) – The hierarchical scaling parameter for clustering1.
r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
rescale_path_type (str) – rescale the hierarchical height by ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method
- Returns
The element-wise similarity between the two clusterings
>>> import clusim.sim as sim >>> from clusim.clustering import Clustering >>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}) >>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[1,2]}) >>> print(sim.element_sim(clustering1, clustering2, alpha=0.9))
-
clusim.clusimelement.element_sim_elscore(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', relabeled_elements=None, ppr_implementation='prpack')[source]¶ The element-centric clustering similarity for each element.
See [GWHA19] for a detailed explaination of the measure.
- Parameters
clustering1 (Clustering) – The first Clustering
clustering2 (Clustering) – The second Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
r1 (float) – The hierarchical scaling parameter for clustering1.
r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method
- Returns
The element-centric similarity between the two clusterings for each element as a 1d numpy array
- Returns
a dict mapping each element to its index of the elementScores array.
>>> import clusim.sim as sim >>> from clusim.clustering import Clustering >>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}) >>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[1,2]}) >>> elementScores, relabeled_elements = sim.element_sim_elseq(clustering1, clustering2, alpha = 0.9) >>> print(elementScores)
-
clusim.clusimelement.cL1(x, y, alpha)[source]¶ The normalized similarity value based on the L1 probabilty metric corrected for the guaranteed overlap in probability between the two vectors, alpha.
See [GWHA19] for a detailed explaination of the need to correct the L1 metric.
- Parameters
x (2d-numpy-array) – The first list of probability vectors
y (2d-numpy-array) – The second list of probability vectors
alpha (float) – The guaranteed overlap in probability between the two vectors in [0,1].
- Returns
The 1d numpy array of L1 similarities between the affinity matrices x and y
-
clusim.clusimelement.make_affinity_matrix(clustering, alpha=0.9, r=1.0, rescale_path_type='max', relabeled_elements=None, ppr_implementation='prpack')[source]¶ The element-centric clustering similarity affinity matrix for a clustering. This function automatically determines the most efficient method to calculate the affinity matrix.
See [GWHA19] for a detailed explaination of the affinity matrix.
- Parameters
clustering (Clustering) – The clustering
alpha (float) – The personalized page-rank return probability.
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
- Returns
The element-centric affinity representation of the clustering as a 2d numpy array
>>> import clusim.sim as sim >>> from clusim.clustering import Clustering >>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}) >>> pprmatrix = sim.make_affinity_matrix(clustering1, alpha=0.9) >>> print(pprmatrix) >>> clustering2 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}) >>> pprmatrix2 = sim.make_affinity_matrix(clustering2, alpha=0.9) >>> print(pprmatrix2)
-
clusim.clusimelement.ppr_partition(clustering, alpha=0.9, relabeled_elements=None)[source]¶ The element-centric clustering similarity affinity matrix for a partition found analytically.
- Parameters
clustering (Clustering) – The Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
- Returns
2d numpy array The element-centric affinity representation of the clustering
>>> import clusim.sim as sim >>> from clusim.clustering import Clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]} >>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict) >>> pprmatrix = sim.ppr_partition(clustering1, alpha=0.9) >>> print(pprmatrix)
-
clusim.clusimelement.make_cielg(clustering, r=1.0, rescale_path_type='max', relabeled_elements=None)[source]¶ Create the cluster-induced element graph for a Clustering.
- Parameters
clustering (Clustering) – The clustering
r (float) – The hierarchical scaling parameter.
rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
- Returns
The cluster-induced element graph for a Clustering as an igraph.WeightedGraph
>>> import clusim.sim as sim >>> from clusim.clustering import Clustering >>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]} >>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict) >>> pprmatrix = sim.make_cielg(clustering1, r = 1.0) >>> print(pprmatrix)
-
clusim.clusimelement.find_groups_in_cluster(clustervs, elementgroupList)[source]¶ A utility function to find vertices with the same cluster memberships.
- Parameters
clustervs (igraph.vertex) – an igraph vertex instance
elementgroupList (list) – a list containing the vertices to group
- Returns
a list-of-lists containing the groupings of the vertices
-
clusim.clusimelement.numerical_ppr_scores(cielg, clustering, alpha=0.9, relabeled_elements=None, ppr_implementation='prpack')[source]¶ The element-centric clustering similarity affinity matrix for a partition.
- Parameters
cielg (igraph.WeightedGraph) – cielg : An igraph Weighted Graph representation of the cluster-induced element graph
clustering (Clustering) – The Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
relabeled_elements (dict) – (optional) dict The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method
- Returns
2d numpy array The element-centric affinity representation of the clustering
-
clusim.clusimelement.calculate_ppr_with_power_iteration(W_matrix, index, alpha=0.9, repetition=1000, th=0.0001)[source]¶ Implementaion of the personalized page-rank with the power iteration It is 20 times faster than the implemetation in igraph’s “prpack” in large network
- Parameters
cielg (scipy.csr_matrix) – W_matrix : Transition matrix of the given network
index (int) – Index of the target nodes
alpha (float) – The personalized page-rank return probability as a float in [0,1].
repetition (int) – (optional) Maximum iteration for calucalting personalized page-rank
th (int) – (optional) Calculation stop when ||p_i+1 - p_i||∞ falls below th
- Returns
1d numpy array The personalized page-rank result for target nodes
References¶
- ABL10
Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, June 2010.
- CD88
Linda M. Collins and Clyde W. Dent. Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231–242, 1988.
- DDiazGDA05(1,2)
Leon Danon, Albert D ıaz-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008–P09008, September 2005.
- FM83
Edward B. Fowlkes and Colin L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569, 1983.
- GA17(1,2,3,4,5)
Alexander J. Gates and Yong-Yeol Ahn. The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1–28, 2017.
- GWHA19(1,2,3,4)
Alexander J. Gates, Ian B. Wood, William P. Hetrick, and Yong-Yeol Ahn. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific Reports, 2019.
- HA85(1,2)
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
- Jac12
Paul Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912.
- LFK09
Andrea Lancichinetti, Santo Fortunato, and János Kertész. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3):033015, 3 2009.
- Mei03
Marina Meilă. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, pages 173–187. Springer, 2003.
- NYP12
Tamás Nepusz, Haiyuan Yu, and Alberto Paccanaro. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods, 9(5):471–472, 2012.
- NCY19
M. E. J. Newman, George T. Cantwell, and Jean-Gabriel Young. Improved mutual information measure for classification and community detection. arXiv:1907.12581, 2019.
- Ran71
William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846, 1971.