Welcome to CluSim’s documentation!

Installation¶

This package is available in PyPI. Just run the following command on terminal to install.

>>> pip install clusim

You can also source the code directly from the github [project page](https://github.com/Hoosier-Clusters/clusim).

Usage¶

A first comparison¶

We start by importing the required modules

>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim

The simplest way to make a Clustering is to use an elm2clu_dict which maps each element.

>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[2], 5:[2]})

>>> print_clustering(c1)
>>> 01|23|45
>>> print_clustering(c2)
>>> 0|123|45

Finally, the similarity of the two Clusterings can be found using the Jaccard Index.

>>> sim.jaccard_index(c1, c2)
>>> 0.4

Basics of element-centric similarity¶

>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim

>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})

The basic element-centric similarity score with a fixed alpha:

>>> sim.element_sim(c1, c2, alpha = 0.9)
>>> 0.6944444444444443

We can also get the element scores. Note that since non-numberic elements are allowed, the element scores also returns a dict which maps the elements to the index in the elementScore array.

>>> elementScores, relabeled_elements = sim.element_sim_elscore(c1, c2, alpha = 0.9)
>>> print(elementScores)
>>> [0.5        0.33333333 0.66666667 0.66666667 1.         1.        ]

The above element-centric similarity scores can be (roughly) interpreted as follows: 1. cluster 2 has the same memberships between the clusterings, so elements 4 and 5 have an element-centric similarity of 1.0 2. cluster 0 has one element difference between the clusterings (element 1 moved from cluster 0 to cluster 1), so element 0 has an element-centric similarity of 1/2 3. cluster 1 has one element difference between the clusterings (element 1 moved from cluster 0 to cluster 1), so elements 2 and 3 have an element-centric similarity of 2/3 4. element 1 moved from cluster 0 to cluster 1 so it has an element-centric similarity of 1/3

Additional CluSim examples¶

Many more examples can be found in the jupyter notebooks included with the package: 1. Using Similarity Measures 2. Adjusting The Rand Index for different random models 3. Adjusting Normalized Mutual Information 4. Basics of Elementcentric Similarity 5. Valid Clusterings 6. Application with SciKit Learn 7. Application with Hierarchical Clustering

The “Clustering”¶

class clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)[source]¶

Base class for clusterings.

Parameters

elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.

Methods

`clustering_from_igraph_cover`(self, igraphcover)	This method creates a clustering from an igraph VertexCover object.
`copy`(self)	Return a deep copy of the clustering.
`downstream_elements`(self, cluster)	This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
`find_clu_size_seq`(self)	This method finds the cluster size sequence for the clustering.
`find_num_overlap`(self)	This method finds the number of elements which are in more than one cluster in the clustering.
`from_clu2elm_dict`(self, clu2elm_dict)	This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, …
`from_cluster_list`(self, cluster_list)	This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], …
`from_dict`(self, clustering_dict)	This method creates a Custering object from a dictionary.
`from_digraph`(self, hier_graph[, …])	This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’.
`from_elm2clu_dict`(self, elm2clu_dict)	This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, …
`from_membership_list`(self, membership_list)	This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, …
`from_scipy_linkage`(self, linkage_matrix[, …])	This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering.
`load`(self, file)	Load the Custering from a file using json.
`merge_clusters`(self, c1, c2[, new_name])	This method merges the elements in two clusters from the clustering.
`relabel_clusters_by_size`(self)	This method renames all clusters by their size.
`relabel_clusters_to_match`(self, …)	This method renames all clusters to have maximal overlap to the ‘target_clustering’.
`save`(self, file)	Save the Custering to a file using json.
`to_clu2elm_dict`(self)	Create a clu2elm_dict: {clusterid: [el1, el2, …
`to_cluster_list`(self)	This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], …
`to_dict`(self)	This method turns a Custering object into a dictionary.
`to_elm2clu_dict`(self)	Create a elm2clu_dict: {elementid: [clu1, clu2, …
`to_membership_list`(self)	This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, …
`validate_clustering`(self)	This method checks that the clustering is valid, else raise the appropriate Cluster Error.

cut_at_depth
empty_start
hier_clusdict
to_dendropy_tree

copy(self)[source]¶

Return a deep copy of the clustering.

Returns: deep copy of the clustering

>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)

validate_clustering(self)[source]¶: This method checks that the clustering is valid, else raise the appropriate Cluster Error.

from_elm2clu_dict(self, elm2clu_dict)[source]¶

This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.

Parameters: elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)

from_clu2elm_dict(self, clu2elm_dict)[source]¶

This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.

Parameters: clu2elm_dict (dict) – { clusid: [el1, el2, … ] }

>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)

from_cluster_list(self, cluster_list)[source]¶

This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists or a list of sets, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.

Parameters: cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]

>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)

to_cluster_list(self)[source]¶

This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.

Returns: cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]

from_membership_list(self, membership_list)[source]¶

This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.

Note

Membership Lists can only represent partitions (no overlaps)

Parameters: membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]

>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)

to_membership_list(self)[source]¶

This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.

Note

Membership Lists can only represent partitions (no overlaps)

Returns: list of element memberships, [ clu_for_el1, clu_for_el2, … ]

clustering_from_igraph_cover(self, igraphcover)[source]¶

This method creates a clustering from an igraph VertexCover object. See the igraph.Cover.VertexCover class. Clustering features are then calculated.

Parameters: igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover

to_clu2elm_dict(self)[source]¶

Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.

Returns: dict

to_elm2clu_dict(self)[source]¶

Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.

Returns: dict

find_clu_size_seq(self)[source]¶

This method finds the cluster size sequence for the clustering.

Returns: list of integers A list where the ith entry corresponds to the size of the ith cluster.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]

relabel_clusters_by_size(self)[source]¶

This method renames all clusters by their size.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
>>> clu.relabel_clusters_by_size()
* Cluster Size Sequence: [3, 2, 2]

relabel_clusters_to_match(self, target_clustering)[source]¶

This method renames all clusters to have maximal overlap to the ‘target_clustering’. Is particularly useful for drawing the clusterings.

>>> from clusim.clustering import Clustering, print_clustering
>>> clu1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]})
>>> print(clu1.to_membership_list())
>>>
>>> clu2 = Clustering(elm2clu_dict = {0:[2], 1:[2], 2:[1], 3:[1], 4:[0], 5:[0]})
>>> clu1.relabel_clusters_to_match(clu2)
>>> print(clu1.to_membership_list())

find_num_overlap(self)[source]¶

This method finds the number of elements which are in more than one cluster in the clustering.

Returns: The number of elements in at least two clusters.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1

merge_clusters(self, c1, c2, new_name=None)[source]¶

This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.

Returns: self

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)

to_dict(self)[source]¶

This method turns a Custering object into a dictionary. Intended for use with json to save the Clusering.

Returns: dictionary

from_dict(self, clustering_dict)[source]¶

This method creates a Custering object from a dictionary. Intended for use with json to load the Clusering.

Param: dict clustering_dict The dictionary represetnation of the Clustering

save(self, file)[source]¶

Save the Custering to a file using json.

Param: str file The name of the file to dump the Clustering json
Param: io file The python file to dump the Clustering json

load(self, file)[source]¶

Load the Custering from a file using json.

Param: str file The name of the file to load the Clustering json
Param: io file The python file to load the Clustering json

downstream_elements(self, cluster)[source]¶

This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.

Parameters: cluster – the name of the parent cluster
Returns: element list

from_scipy_linkage(self, linkage_matrix, dist_rescaled=False)[source]¶

This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.

Parameters

linkage_matrix (numpy.matrix) – the linkage matrix from scipy
dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1

>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
                                          size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
                                          size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(Xdata, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)

from_digraph(self, hier_graph, elm2clu_dict=None, clu2elm_dict=None)[source]¶

This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’. The element membership into (at least) the lowest resolution of the clusters must be specified by either a ‘elm2clu_dict’ or a ‘clu2elm_dict’. The hierarchical clustering memeberships are then propagated through the acyclic digraph. Finally Clustering features are then calculated.

Parameters

hier_graph (networkx.DiGraph()) – Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.

>>> from clusim.clustering import Clustering, print_clustering
>>> import networkx as nx
>>>
>>> G = nx.DiGraph()
>>> G.add_edges_from([(0,1), (0,2)])
>>> clu2elm_dict = {1:[0,1,3,4], 2:[5,6,7,8]}
>>> clu = Clustering()
>>> clu.from_digraph(hier_graph = G, clu2elm_dict = clu2elm_dict)

clusim.clusteringerror¶: alias of clusim.clusteringerror

class clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)[source]¶

Base class for clusterings.

Parameters

elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.

Methods

`clustering_from_igraph_cover`(self, igraphcover)	This method creates a clustering from an igraph VertexCover object.
`copy`(self)	Return a deep copy of the clustering.
`downstream_elements`(self, cluster)	This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
`find_clu_size_seq`(self)	This method finds the cluster size sequence for the clustering.
`find_num_overlap`(self)	This method finds the number of elements which are in more than one cluster in the clustering.
`from_clu2elm_dict`(self, clu2elm_dict)	This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, …
`from_cluster_list`(self, cluster_list)	This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], …
`from_dict`(self, clustering_dict)	This method creates a Custering object from a dictionary.
`from_digraph`(self, hier_graph[, …])	This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’.
`from_elm2clu_dict`(self, elm2clu_dict)	This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, …
`from_membership_list`(self, membership_list)	This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, …
`from_scipy_linkage`(self, linkage_matrix[, …])	This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering.
`load`(self, file)	Load the Custering from a file using json.
`merge_clusters`(self, c1, c2[, new_name])	This method merges the elements in two clusters from the clustering.
`relabel_clusters_by_size`(self)	This method renames all clusters by their size.
`relabel_clusters_to_match`(self, …)	This method renames all clusters to have maximal overlap to the ‘target_clustering’.
`save`(self, file)	Save the Custering to a file using json.
`to_clu2elm_dict`(self)	Create a clu2elm_dict: {clusterid: [el1, el2, …
`to_cluster_list`(self)	This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], …
`to_dict`(self)	This method turns a Custering object into a dictionary.
`to_elm2clu_dict`(self)	Create a elm2clu_dict: {elementid: [clu1, clu2, …
`to_membership_list`(self)	This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, …
`validate_clustering`(self)	This method checks that the clustering is valid, else raise the appropriate Cluster Error.

cut_at_depth
empty_start
hier_clusdict
to_dendropy_tree

copy(self)[source]¶

Return a deep copy of the clustering.

Returns: deep copy of the clustering

>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)

validate_clustering(self)[source]¶: This method checks that the clustering is valid, else raise the appropriate Cluster Error.

from_elm2clu_dict(self, elm2clu_dict)[source]¶

This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.

Parameters: elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)

from_clu2elm_dict(self, clu2elm_dict)[source]¶

This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.

Parameters: clu2elm_dict (dict) – { clusid: [el1, el2, … ] }

>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)

from_cluster_list(self, cluster_list)[source]¶

This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists or a list of sets, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.

Parameters: cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]

>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)

to_cluster_list(self)[source]¶

This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.

Returns: cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]

from_membership_list(self, membership_list)[source]¶

This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.

Note

Membership Lists can only represent partitions (no overlaps)

Parameters: membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]

>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)

to_membership_list(self)[source]¶

This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.

Note

Membership Lists can only represent partitions (no overlaps)

Returns: list of element memberships, [ clu_for_el1, clu_for_el2, … ]

clustering_from_igraph_cover(self, igraphcover)[source]¶

This method creates a clustering from an igraph VertexCover object. See the igraph.Cover.VertexCover class. Clustering features are then calculated.

Parameters: igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover

to_clu2elm_dict(self)[source]¶

Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.

Returns: dict

to_elm2clu_dict(self)[source]¶

Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.

Returns: dict

find_clu_size_seq(self)[source]¶

This method finds the cluster size sequence for the clustering.

Returns: list of integers A list where the ith entry corresponds to the size of the ith cluster.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]

relabel_clusters_by_size(self)[source]¶

This method renames all clusters by their size.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
>>> clu.relabel_clusters_by_size()
* Cluster Size Sequence: [3, 2, 2]

relabel_clusters_to_match(self, target_clustering)[source]¶

This method renames all clusters to have maximal overlap to the ‘target_clustering’. Is particularly useful for drawing the clusterings.

>>> from clusim.clustering import Clustering, print_clustering
>>> clu1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]})
>>> print(clu1.to_membership_list())
>>>
>>> clu2 = Clustering(elm2clu_dict = {0:[2], 1:[2], 2:[1], 3:[1], 4:[0], 5:[0]})
>>> clu1.relabel_clusters_to_match(clu2)
>>> print(clu1.to_membership_list())

find_num_overlap(self)[source]¶

This method finds the number of elements which are in more than one cluster in the clustering.

Returns: The number of elements in at least two clusters.

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1

merge_clusters(self, c1, c2, new_name=None)[source]¶

This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.

Returns: self

>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)

to_dict(self)[source]¶

This method turns a Custering object into a dictionary. Intended for use with json to save the Clusering.

Returns: dictionary

from_dict(self, clustering_dict)[source]¶

This method creates a Custering object from a dictionary. Intended for use with json to load the Clusering.

Param: dict clustering_dict The dictionary represetnation of the Clustering

save(self, file)[source]¶

Save the Custering to a file using json.

Param: str file The name of the file to dump the Clustering json
Param: io file The python file to dump the Clustering json

load(self, file)[source]¶

Load the Custering from a file using json.

Param: str file The name of the file to load the Clustering json
Param: io file The python file to load the Clustering json

downstream_elements(self, cluster)[source]¶

This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.

Parameters: cluster – the name of the parent cluster
Returns: element list

from_scipy_linkage(self, linkage_matrix, dist_rescaled=False)[source]¶

This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.

Parameters

linkage_matrix (numpy.matrix) – the linkage matrix from scipy
dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1

>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
                                          size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
                                          size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(Xdata, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)

from_digraph(self, hier_graph, elm2clu_dict=None, clu2elm_dict=None)[source]¶

This method creates a hierarchical clustering with a cluster structure specified by an acyclic digraph, ‘hier_graph’. The element membership into (at least) the lowest resolution of the clusters must be specified by either a ‘elm2clu_dict’ or a ‘clu2elm_dict’. The hierarchical clustering memeberships are then propagated through the acyclic digraph. Finally Clustering features are then calculated.

Parameters

hier_graph (networkx.DiGraph()) – Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.

>>> from clusim.clustering import Clustering, print_clustering
>>> import networkx as nx
>>>
>>> G = nx.DiGraph()
>>> G.add_edges_from([(0,1), (0,2)])
>>> clu2elm_dict = {1:[0,1,3,4], 2:[5,6,7,8]}
>>> clu = Clustering()
>>> clu.from_digraph(hier_graph = G, clu2elm_dict = clu2elm_dict)

class clusim.clustering.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶

Methods

`default`(self, obj)	Implement this method in a subclass such that it returns a serializable object for `o`, or calls the base implementation (to raise a `TypeError`).
`encode`(self, o)	Return a JSON string representation of a Python data structure.
`iterencode`(self, o[, _one_shot])	Encode the given object and yield each string representation as available.

default(self, obj)[source]¶

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

clusim.clustering.print_clustering(clustering)[source]¶

A function to print a clustering. Clusters are seperated by ‘|’. The fuction will only print the leaf layer of a Hierarchical Clustering.

Parameters: clustering (Clsutering) – The clustering to print

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)

clusim.clustering.remap2match(clustering1, clustering2)[source]¶

Renumber membership assignments so that the first clustering has maximal overlap to the second clustering. Useful for drawing consistend pictures.

Only works with partitions.

For example:

>>> print(remap2match([3,3,1,1,0],[2,2,3,3,3]))
[2 2 3 3 4]

Parameters

clustering1 (Clustering) – clustering to remap
clustering2 (Clustering) – clustering to match (treated as the groundtruth)

Returns

Remapped assignment of clusters from clustering1 to an equivalent label from clustering2

Return type

dict

Clustering Generation¶

clusim.clugen.make_equal_clustering(n_elements, n_clusters)[source]¶

This function creates a random clustering with equally sized clusters. If n_elements % n_clusters != 0, cluster sizes will differ by one element.

Parameters

n_elements (int) – The number of elements
n_clusters (int) – The number of clusters

Returns

The new clustering with equally sized clusters.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)

clusim.clugen.make_random_clustering(n_elements=1, n_clusters=1, clu_size_seq=[1, 2], random_model='all', tol=1e-15)[source]¶

This function creates a random clustering according to one of three random models. It is a wrapper around the specific functions for each random model.

Parameters

n_elements (int) – The number of elements
n_clusters (int) – The number of clusters
random_mode (str) –
The random model to use:

’all’uniform distrubtion over the set of all clusterings of
n_elements

’num’uniform distrubtion over the set of all clusterings of
n_elements in n_clusters

’perm’ : the Permutation Model
tol (float) – optional The tolerance used by the algorithm for ‘all’ clusterings

Returns

The new clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
                                 random_model = 'num')
>>> print_clustering(clu)

clusim.clugen.cluster_missing_elements(element_list, elm2clu_dict, new_cluster_type='singleton')[source]¶

Sometimes a clustering algorithm does not assign every element to a cluster. This function adds the missing elements to their own cluster(s).

Parameters

element_list (list) – The complete list of elements
elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
new_cluster_type (str) –
The new type of clusters to use:

’singleton’ : each unassigned element is put into its own singleton cluster

’giant’ : all unassigned elements are put into a single giant cluster

Returns

The new elm2clu_dict.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering

>>> clu = clugen.make_random_clustering(n_elements = 7)
>>> print_clustering(clu)
>>> clu = clugen.cluster_missing_elements(element_list = list(range(10)), elm2clu_dict = clu,  new_cluster_type='singleton')
>>> print_clustering(clu)

clusim.clugen.make_singleton_clustering(n_elements)[source]¶

This function creates a clustering with each element in its own cluster.

Parameters: n_elements (int) – The number of elements
Returns: The new clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_singleton_clustering(n_elements = 9)
>>> print_clustering(clu)

clusim.clugen.make_random_dendrogram(n_elements)[source]¶

This function creates a random Hierarchical Clustering.

:param int n_elements The number of elements

Returns: The new clustering.

clusim.clugen.shuffle_memberships(clustering, percent=1.0)[source]¶

This function creates a new clustering by shuffling the element memberships from the original clustering.

Parameters

clustering (Clustering) – The original clustering.
percent (float) – optional (default 1.0) The fractional percentage (between 0.0 and 1.0) of the elements to shuffle.

Returns

The new clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
                                      random_model = 'num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships(orig_clu, percent = 0.5)
>>> print_clustering(shuffle_clu)

clusim.clugen.shuffle_memberships_pa(clustering, n_steps=1, constant_num_clusters=True)[source]¶

This function creates a new clustering by shuffling the element memberships from the original clustering according to the preferential attachment model.

See [GA17] for a detailed explaination of the preferential attachment model.

Parameters

clustering (Clustering) – The original clustering.
n_steps (int) – optional (default 1) The number of times to run the preferential attachment algorithm.
constant_num_clusters (Boolean) – optional (default True) Reject a shuffling move if it leaves a cluster with no elements. Set to True to keep the number of clusters constant.

Returns

The new clustering with shuffled memberships.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                      random_model='num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships_pa(orig_clu, n_steps=10,
                                         constant_num_clusters=True)
>>> print_clustering(shuffle_clu)

clusim.clugen.generate_random_partition_all(n_elements, tol=1e-15)[source]¶

This function creates a random clustering according to the ‘All’ random model by uniformly selecting a clustering from the ensemble of all clusterings with n_elements.

Parameters

n_elements (int) – The number of elements
tol (float) – (optional) The tolerance used by the algorithm to approximate the probability distrubtion

Returns

The randomly genderated clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.generate_random_partition_all(n_elements = 9)
>>> print_clustering(clu)

clusim.clugen.enumerate_random_partition_num(n_elements, n_clusters)[source]¶

A generator for every partition in ‘Num’, the ensemble of all clusterings with n_elements grouped into n_clusters, non-empty clusters.

Based on the solution provided by Adeel Zafar Soomro: a link.

which was itself based on the algorithm from Knuth: (Algorithm U) is described by Knuth in the Art of Computer Programming, Volume 4, Fascicle 3B

Parameters

n_elements (int) – The number of elements
n_clusters (int) – The number of clusters

Returns

The new clustering as a cluster list.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> for clu in clugen.clustering_ensemble_generator_num(n_elements=5, n_clusters=3):
>>>     print_clustering(clu)

Clustering Similarity¶

The different clustering similarity measures available.

Pairwise Counting Measures¶

clusim.sim.contingency_table(clustering1, clustering2)[source]¶

This function creates the contingency table between two clusterings.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The clustering1.n_clusters by clustering2.n_clusters contingency table as a list of lists

>>> import clusim.clugen as clugen
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model = 'num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model = 'num')
>>> cont_table = contingency_table(clustering1, clustering2)
>>> print(cont_table)

clusim.sim.count_pairwise_cooccurence(clustering1, clustering2)[source]¶

This function finds the pairwise cooccurence counts between two clusterings.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

N11int: The number of element pairs assigned to the same clusters in both clusterings
N10int: The number of element pairs assigned to the same clusters in clustering1, but different clusters in clustering2
N01int: The number of element pairs assigned to different clusters in clustering1, but the same clusters in clustering2
N00int: The number of element pairs assigned to different clusters in both clusterings

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> from clusim.clustering import print_clustering
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model='num')
>>> N11, N10, N01, N00 = sim.count_pairwise_cooccurence(clustering1,
                                                    clustering2)
>>> print_clustering(clustering1)
>>> print_clustering(clustering2)
>>> print(N11,
        "element pairs assigned to the same clusters in both clusterings")
>>> print(N10,
        "element pairs assigned to the same clusters in clustering1, but "
        "different clusters in clustering2")
>>> print(N01, "element pairs assigned to different clusters in "
               "clustering1, but the same clusters in clustering2")
>>> print(N00, "element pairs assigned to different clusters in both "
               "clusterings")

clusim.sim.jaccard_index(clustering1, clustering2)[source]¶

This function calculates the Jaccard index between two clusterings [Jac12].

J = N11/(N11+N10+N01)

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Jaccard index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9,
                                                n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9,
                                                n_clusters=3,
                                                random_model='num')
>>> print(sim.jaccard_index(clustering1, clustering2) )

clusim.sim.rand_index(clustering1, clustering2)[source]¶

This function calculates the Rand index between two clusterings [Ran71].

RI = (N11 + N00) / (N11 + N10 + N01 + N00)

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Rand index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.rand_index(clustering1, clustering2))

clusim.sim.fowlkes_mallows_index(clustering1, clustering2)[source]¶

This function calculates the Fowlkes and Mallows index between two clusterings [FM83].

FM = N11 / sqrt( (N11 + N10) * (N11 + N01) )

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Fowlkes and Mallows index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.fowlkes_mallows_index(clustering1, clustering2))

clusim.sim.fmeasure(clustering1, clustering2)[source]¶

This function calculates the F-measure between two clusterings.

Also known as: Czekanowski index Dice Symmetric index Sorensen index

F = 2*N11 / (2*N11 + N10 + N01)

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The F-measure (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.fmeasure(clustering1, clustering2))

clusim.sim.purity_index(clustering1, clustering2)[source]¶

This function calculates the Purity index between two clusterings.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Purity index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.purity_index(clustering1, clustering2))

clusim.sim.classification_error(clustering1, clustering2)[source]¶

This function calculates the Jaccard index between two clusterings.

CE = 1 - PI

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Classification Error (between 0.0 and 1.0)

Note

CE is a distance measure, it is 0 for identical clusterings

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.classification_error(clustering1, clustering2))

clusim.sim.czekanowski_index(clustering1, clustering2)[source]¶

This function calculates the Czekanowski index between two clusterings.

See Fmeasure

clusim.sim.dice_index(clustering1, clustering2)[source]¶

This function calculates the Dice index between two clusterings.

See Fmeasure

clusim.sim.sorensen_index(clustering1, clustering2)[source]¶

This function calculates the Sorensen index between two clusterings.

See Fmeasure

clusim.sim.rogers_tanimoto_index(clustering1, clustering2)[source]¶

This function calculates the Rogers and Tanimoto index between two clusterings.

RT = (N11 + N00)/(N11 + 2*(N10+N01) + N00)

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Rogers and Tanimoto index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.rogers_tanimoto_index(clustering1, clustering2))

clusim.sim.southwood_index(clustering1, clustering2)[source]¶

calculate the southwood index

N11 / (N10 + N01)

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Southwood index (between 0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.southwood_index(clustering1, clustering2))

clusim.sim.pearson_correlation(clustering1, clustering2)[source]¶

This function calculates the Pearson Correlation between two clusterings.

PC = (N11*N00 - N01*N10) / ((N11+N10) * (N11+N01) * (N00+N10) * (N00+N01))

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Pearson Correlation (between -1.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.pearson_correlation(clustering1, clustering2))

Information Theoretic Measures¶

clusim.sim.mi(clustering1, clustering2)[source]¶

This function calculates the Mutual Information (MI) between two clusterings [DDiazGDA05].

MI = (S(c1) + S(c2) - S(c1, c2))

where S(c1) is the Shannon Entropy of the clustering size distribution, S(c1, c2) is the Shannon Entropy of the join clustering size distribution,

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

The Mutual Information (between 0.0 and inf)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.mi(clustering1, clustering2))

clusim.sim.nmi(clustering1, clustering2, norm_type='sum')[source]¶

This function calculates the Normalized Mutual Information (NMI) between two clusterings [DDiazGDA05].

NMI = (S(c1) + S(c2) - S(c1, c2)) / norm(c1, c2)

where S(c1) is the Shannon Entropy of the clustering size distribution, S(c1, c2) is the Shannon Entropy of the join clustering size distribution, and norm(c1,c2) is a normalization term.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization

Returns

The Normalized Mutual Information index (between 0.0 and inf)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.nmi(clustering1, clustering2, norm_type='sum'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='max'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='min'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='sqrt'))

clusim.sim.vi(clustering1, clustering2, norm_type='none')[source]¶

This function calculates the Variation of Information (VI) between two clusterings [Mei03].

VI is technically a distance measure and can assume values in the range [0, inf), where 0 denotes identical clusterings.

VI = 2*S(c1, c2) - S(c1) - S(c2)

where S(c1) is the Shannon Entropy of the clustering size distribution, and S(c1, c2) is the Shannon Entropy of the join clustering size distribution.

The VI can be transformed into a clustering similarity measure via the appropriate normalization.

VI_{sim} = 1 - 0.5*((S(c1,c2) - S(c1))/S(c2) + (S(c1,c2) - S(c2))/S(c1))

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

norm_type‘none’ (default) or ‘entropy’: The normalization type. ‘none’ returns the standard VI as a distance metric, ‘entropy’ retuns the normalized VI as a similarity measure

Returns: The Variation of Information index (between 0.0 and inf)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.vi(clustering1, clustering2))

clusim.sim.rmi(clustering1, clustering2, norm_type='none', logbase=2)[source]¶

This function calculates the Reduced Mutual Information (RMI) between two clusterings [NCY19].

RMI = MI(c1, c2) - log Omega(a, b) / n

where MI(c1, c2) is mutual information of the clusterings c1 and c2, and where Omega(a, b) is the number of contingency tables with row and column sums equal to a and b.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
norm_type (str) – ‘none’ (default) The normalization types are: ‘none’ returns the RMI without a normalization. ‘normalized’ returns the RMI with upper bound equals to 1.
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).

Returns

The Reduced Mutual Information index (between 0.0 and inf)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.rmi(clustering1, clustering2, norm_type='none'))
>>> print(sim.rmi(clustering1, clustering2, norm_type='normalized'))

Correction for Chance¶

clusim.sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm', norm_type='sum', n_samples=100)[source]¶

This function calculates the adjusted Similarity for one of six random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
measure (str) – The similarity measure to evaluate. Must be one of the available_similarity_measures.
random_model (str) –
The random model to use:

’all’uniform distribution over the set of all clusterings of
n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type used if the measure is ‘nmi’ ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization

n_samplesint: The number of random Clusterings sampled to determine the expected similarity.

Returns: The adjusted Similarity measure

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='all'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='all1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='num'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='num1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='perm'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='perm1'))

clusim.sim.sample_expected_sim(clustering1, clustering2, measure='jaccard_index', random_model='perm', n_samples=1, keep_samples=False)[source]¶

This function calculates the expected Similarity for all pair-wise comparisons between Clusterings drawn from one of six random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
measure (str) – The similarity measure to evaluate. Must be one of the measures listed in sim.available_similarity_measures.
random_model (string) –
The random model to use:

’all’uniform distribution over the set of all clusterings of
n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
n_samples (int) – The number of random Clusterings sampled to determine the expected similarity.
keep_samples (bool) – If True, returns the Similarity samples themselves, otherwise return their mean.

Returns

The expected Similarity measure for all pair-wise comparisons under a random model

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> c1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> c2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm1', n_samples=50) )

clusim.sim.expected_rand_index(n_elements, random_model='num', n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None)[source]¶

This function calculates the expectation of the Rand index between all pairs of clusterings drawn from one of six random models.

See [HA85] and [GA17] for a detailed derivation and explanation of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

n_elements (int) – The number of elements
random_model (str) –
The random model to use:

’all’uniform distribution over the set of all clusterings of
n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
n_clusters1 (int) – optional The number of clusters in the first clustering
n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expectations
clu_size_seq1 (list) – optional The cluster size sequence of the first clustering as a list of ints
clu_size_seq2 (list) – optional The cluster size sequence of the second clustering as a list of ints

Returns

The expected Rand index (between 0.0 and 1.0)

>>> import clusim.sim as sim
>>> print(sim.expected_rand_index(n_elements=5, random_model='all'))
>>> print(sim.expected_rand_index(n_elements=5, random_model='all1',
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, , random_model='num',
                                     n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_rand_index(n_elements=5, random_model='num1',
                                     n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm1',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))

clusim.sim.adjrand_index(clustering1, clustering2, random_model='perm')[source]¶

This function calculates the adjusted Rand index for one of six random models.

See [HA85] and [GA17] for a detailed derivation and explanation of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
random_model (str) –
The random model to use:

’all’uniform distribution over the set of all clusterings of
n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’

Returns

The adjusted_rand Rand index

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='all'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='all1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='num'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='num1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='perm'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='perm1'))

clusim.sim.adj_mi(clustering1, clustering2, random_model='perm', norm_type='sum', logbase=2)[source]¶

This function calculates the adjusted Mutual Information for one of six random models.

See [GA17] for a detailed derivation and explanation of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.
random_model (string) –
The random model to use:

’all’ : uniform distribution over the set of all clusterings of n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).

Returns

The adjusted Mutual Information

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all'))
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm1'))

clusim.sim.expected_mi(n_elements, n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None, logbase=2, random_model='num')[source]¶

This function calculates the expectation of the Mutual Information between all pairs of clusterings drawn from one of six random models.

See [GA17] for a detailed derivation and explanation of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters

n_elements (int) – The number of elements
n_clusters1 (int) – optional The number of clusters in the first clustering
n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expectations
clu_size_seq1 (list) – optional The cluster size sequence of the first clustering as a list of ints.
clu_size_seq2 (list) – optional The cluster size sequence of the second clustering as a list of ints.
random_model (str) –
The random model to use:

’all’uniform distribution over the set of all clusterings of
n_elements

’all1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements

’num’uniform distribution over the set of all clusterings of
n_elements in n_clusters

’num1’one-sided selection from the uniform distribution over the set
of all clusterings of n_elements in n_clusters

’perm’ : the permutation model for a fixed cluster size sequence

’perm1’one-sided selection from the permutation model for a fixed
cluster size sequence, same as ‘perm’
logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).

Returns

The expected MI (between 0.0 and inf)

>>> import clusim.sim as sim
>>> print(sim.expected_mi(n_elements=5, random_model='all'))
>>> print(sim.expected_mi(n_elements=5, random_model='all1',
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, , random_model='num',
                                     n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_mi(n_elements=5, random_model='num1',
                                     n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm1',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))

Overlapping Clustering Similarity¶

clusim.sim.onmi(clustering1, clustering2)[source]¶

This function calculates the overlapping normalized mutual information.

See [LFK09] for a detailed derivation and explanation of the measure.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

the overlapping normalized mutual information

clusim.sim.omega_index(clustering1, clustering2)[source]¶

This function calculates the omega index between two clusterings.

See [CD88] for a detailed derivation and explanation of the measure.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

the omega index

clusim.sim.geometric_accuracy(clustering1, clustering2)[source]¶

This function calculates the geometric accuracy between two (overlapping) clusterings.

See [NYP12] for a detailed derivation and explanation of the measure.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

the geometric accuracy

clusim.sim.overlap_quality(clustering1, clustering2)[source]¶

This function calculates the overlap quality between two (overlapping) clusterings.

See [ABL10] for a detailed derivation and explanation of the measure.

Parameters

clustering1 (Clustering) – The first clustering.
clustering2 (Clustering) – The second clustering.

Returns

the overlap quality

Element-centric Clustering Similarity¶

clusim.clusimelement.element_sim(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', ppr_implementation='prpack')[source]¶

The element-centric clustering similarity.

See [GWHA19] for a detailed explaination of the measure.

Parameters

clustering1 (Clustering) – The first Clustering
clustering2 (Clustering) – The second Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
r1 (float) – The hierarchical scaling parameter for clustering1.
r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
rescale_path_type (str) – rescale the hierarchical height by ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method

Returns

The element-wise similarity between the two clusterings

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[1,2]})
>>> print(sim.element_sim(clustering1, clustering2, alpha=0.9))

clusim.clusimelement.element_sim_elscore(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', relabeled_elements=None, ppr_implementation='prpack')[source]¶

The element-centric clustering similarity for each element.

See [GWHA19] for a detailed explaination of the measure.

Parameters

clustering1 (Clustering) – The first Clustering
clustering2 (Clustering) – The second Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
r1 (float) – The hierarchical scaling parameter for clustering1.
r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method

Returns

The element-centric similarity between the two clusterings for each element as a 1d numpy array

Returns

a dict mapping each element to its index of the elementScores array.

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[1,2]})
>>> elementScores, relabeled_elements = sim.element_sim_elseq(clustering1,
                                                          clustering2,
                                                          alpha = 0.9)
>>> print(elementScores)

clusim.clusimelement.cL1(x, y, alpha)[source]¶

The normalized similarity value based on the L1 probabilty metric corrected for the guaranteed overlap in probability between the two vectors, alpha.

See [GWHA19] for a detailed explaination of the need to correct the L1 metric.

Parameters

x (2d-numpy-array) – The first list of probability vectors
y (2d-numpy-array) – The second list of probability vectors
alpha (float) – The guaranteed overlap in probability between the two vectors in [0,1].

Returns

The 1d numpy array of L1 similarities between the affinity matrices x and y

clusim.clusimelement.make_affinity_matrix(clustering, alpha=0.9, r=1.0, rescale_path_type='max', relabeled_elements=None, ppr_implementation='prpack')[source]¶

The element-centric clustering similarity affinity matrix for a clustering. This function automatically determines the most efficient method to calculate the affinity matrix.

See [GWHA19] for a detailed explaination of the affinity matrix.

Parameters

clustering (Clustering) – The clustering
alpha (float) – The personalized page-rank return probability.
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.

Returns

The element-centric affinity representation of the clustering as a 2d numpy array

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[1], 3:[1],
                                           4:[2], 5:[2]})
>>> pprmatrix = sim.make_affinity_matrix(clustering1, alpha=0.9)
>>> print(pprmatrix)
>>> clustering2 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1],
                                           4:[2], 5:[2]})
>>> pprmatrix2 = sim.make_affinity_matrix(clustering2, alpha=0.9)
>>> print(pprmatrix2)

clusim.clusimelement.ppr_partition(clustering, alpha=0.9, relabeled_elements=None)[source]¶

The element-centric clustering similarity affinity matrix for a partition found analytically.

Parameters

clustering (Clustering) – The Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.

Returns

2d numpy array The element-centric affinity representation of the clustering

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.ppr_partition(clustering1, alpha=0.9)
>>> print(pprmatrix)

clusim.clusimelement.make_cielg(clustering, r=1.0, rescale_path_type='max', relabeled_elements=None)[source]¶

Create the cluster-induced element graph for a Clustering.

Parameters

clustering (Clustering) – The clustering
r (float) – The hierarchical scaling parameter.
rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.

Returns

The cluster-induced element graph for a Clustering as an igraph.WeightedGraph

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.make_cielg(clustering1, r = 1.0)
>>> print(pprmatrix)

clusim.clusimelement.find_groups_in_cluster(clustervs, elementgroupList)[source]¶

A utility function to find vertices with the same cluster memberships.

Parameters

clustervs (igraph.vertex) – an igraph vertex instance
elementgroupList (list) – a list containing the vertices to group

Returns

a list-of-lists containing the groupings of the vertices

clusim.clusimelement.numerical_ppr_scores(cielg, clustering, alpha=0.9, relabeled_elements=None, ppr_implementation='prpack')[source]¶

The element-centric clustering similarity affinity matrix for a partition.

Parameters

cielg (igraph.WeightedGraph) – cielg : An igraph Weighted Graph representation of the cluster-induced element graph
clustering (Clustering) – The Clustering
alpha (float) – The personalized page-rank return probability as a float in [0,1].
relabeled_elements (dict) – (optional) dict The elements maped to indices of the affinity matrix.
ppr_implementation (str) – (optional) Choose a implementation for personalized page-rank calcuation. ‘prpack’: use PPR alogrithms in igraph ‘power_iteration’: use power_iteration method

Returns

2d numpy array The element-centric affinity representation of the clustering

clusim.clusimelement.calculate_ppr_with_power_iteration(W_matrix, index, alpha=0.9, repetition=1000, th=0.0001)[source]¶

Implementaion of the personalized page-rank with the power iteration It is 20 times faster than the implemetation in igraph’s “prpack” in large network

Parameters

cielg (scipy.csr_matrix) – W_matrix : Transition matrix of the given network
index (int) – Index of the target nodes
alpha (float) – The personalized page-rank return probability as a float in [0,1].
repetition (int) – (optional) Maximum iteration for calucalting personalized page-rank
th (int) – (optional) Calculation stop when ||p_i+1 - p_i||∞ falls below th

Returns

1d numpy array The personalized page-rank result for target nodes

References¶

ABL10: Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, June 2010.
CD88: Linda M. Collins and Clyde W. Dent. Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231–242, 1988.
DDiazGDA05(1,2): Leon Danon, Albert D ıaz-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008–P09008, September 2005.
FM83: Edward B. Fowlkes and Colin L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569, 1983.
GA17(1,2,3,4,5): Alexander J. Gates and Yong-Yeol Ahn. The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1–28, 2017.
GWHA19(1,2,3,4): Alexander J. Gates, Ian B. Wood, William P. Hetrick, and Yong-Yeol Ahn. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific Reports, 2019.
HA85(1,2): Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
Jac12: Paul Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912.
LFK09: Andrea Lancichinetti, Santo Fortunato, and János Kertész. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3):033015, 3 2009.
Mei03: Marina Meilă. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, pages 173–187. Springer, 2003.
NYP12: Tamás Nepusz, Haiyuan Yu, and Alberto Paccanaro. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods, 9(5):471–472, 2012.
NCY19: M. E. J. Newman, George T. Cantwell, and Jean-Gabriel Young. Improved mutual information measure for classification and community detection. arXiv:1907.12581, 2019.
Ran71: William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846, 1971.

Installation¶

Usage¶

A first comparison¶

Basics of element-centric similarity¶

Additional CluSim examples¶

The “Clustering”¶

Clustering Generation¶

Clustering Similarity¶

Pairwise Counting Measures¶

Information Theoretic Measures¶

Correction for Chance¶

Overlapping Clustering Similarity¶

Element-centric Clustering Similarity¶

References¶

Table of Contents

This Page