DenMune: A densitypeak clustering algorithm
DenMune a clustering algorithm that can find clusters of arbitrary size, shapes and densities in twodimensions. Higher dimensions are first reduced to 2D using the tsne. The algorithm relies on a single parameter K (the number of nearest neighbors). The results show the superiority of the algorithm. Enjoy the simplicity but the power of DenMune.
Scientific Work
Paper  Journal  Data 

Coding & Maintenance
Git Repo  Code Style  Installation  CI Workflow  Code Coverage 

Docs & Tutorials
Read the Docs  Repo2Docker  Colab  kaggle  ReviewNB 

Downloads Stats
downloads/day  download/week  download/month 

Suggestions & Reporting Issues
Github  Gitter  Slack 

Based on the paper
Paper 

Mohamed Abbas, Adel ElZoghabi, Amin Shoukry, 
DenMune: Density peak based clustering using mutual nearest neighbors 
In: Journal of Pattern Recognition, Elsevier, 
volume 109, number 107589, January 2021 
DOI: https://doi.org/10.1016/j.patcog.2020.107589 
Documentation:
Documentation, including tutorials, are available on https://denmune.readthedocs.io
[![read the documentation](https://img.shields.io/badge/read_thedocsorange)](https://denmune.readthedocs.io/en/latest/?badge=latest)
Watch it in action
This 30 seconds will tell you how a densitybased algorithm, DenMune propagates:
When less means more
Most calssic clustering algorithms fail in detecting complex clusters where clusters are of different size, shape, density, and being exist in noisy data. Recently, a densitybased algorithm named DenMune showed great ability in detecting complex shapes even in noisy data. it can detect number of clusters automatically, detect both preidentifiednoise and postidentifiednoise automatically and removing them.
It can achieve accuracy reach 100% in some classic pattern problems, achieve 97% in MNIST dataset. A great advantage of this algorithm is being singleparameter algorithm. All you need is to set number of knearest neighbor and the algorithm will care about the rest. Being Nonsensitive to changes in k, make it robust and stable.
Keep in mind, the algorithm reduce any ND dataset to only 2D dataset initially, so it is a good benefit of this algorithm is being always to plot your data and explore it which make this algorithm a good candidate for data exploration. Finally, the algorithm comes with neat package for visualizing data, validating it and analyze the whole clustering process.
How to install DenMune
Simply install DenMune clustering algorithm using pip command from the official Python repository
From the shell run the command
shell
pip install denmune
From Jupyter notebook cell run the command
ipython3
!pip install denmune
How to use DenMune
Once DenMune is installed, you just need to import it
python
from denmune import DenMune
###### Please note that first denmune (the package) in small letters, while the other one(the class itself) has D and M in capital case.
Read data
There are four possible cases of data:  only train data without labels  only labeled train data  labeled train data in addition to test data without labels  labeled train data in addition to labeled test data
```python #============================================= # First scenario: train data without labels # ============================================
data_path = 'datasets/denmune/chameleon/'
dataset = "t7.10k.csv"
data_file = data_path + dataset
# train data without labels X_train = pd.read_csv(data_file, sep=',', header=None)
knn = 39 # knearest neighbor, the only parameter required by the algorithm
dm = DenMune(train_data=X_train, k_nearest=knn) labels, validity = dm.fit_predict(show_analyzer=False, show_noise=True)
``` This is an intuitive dataset which has no groundtruth provided
```python #============================================= # Second scenario: train data with labels # ============================================
data_path = 'datasets/denmune/shapes/'
dataset = "aggregation.csv"
data_file = data_path + dataset
# train data with labels X_train = pd.read_csv(data_file, sep=',', header=None) y_train = X_train.iloc[:, 1] X_train = X_train.drop(X_train.columns[1], axis=1)
knn = 6 # knearest neighbor, the only parameter required by the algorithm
dm = DenMune(train_data=X_train, train_truth= y_train, k_nearest=knn) labels, validity = dm.fit_predict(show_analyzer=False, show_noise=True) ``` Datset groundtruth
Dataset as detected by DenMune at k=6
```python #================================================================= # Third scenario: train data with labels in addition to test data # ================================================================
data_path = 'datasets/denmune/pendigits/'
file_2d = data_path + 'pendigits2d.csv'
# train data with labels X_train = pd.read_csv(data_path + 'train.csv', sep=',', header=None) y_train = X_train.iloc[:, 1] X_train = X_train.drop(X_train.columns[1], axis=1)
# test data without labels X_test = pd.read_csv(data_path + 'test.csv', sep=',', header=None) X_test = X_test.drop(X_test.columns[1], axis=1)
knn = 50 # knearest neighbor, the only parameter required by the algorithm
dm = DenMune(train_data=X_train, train_truth= y_train, test_data= X_test, k_nearest=knn) labels, validity = dm.fit_predict(show_analyzer=True, show_noise=True) ``` dataset groundtruth
dataset as detected by DenMune at k=50
test data as predicted by DenMune on training the dataset at k=50
Algorithm's Parameters
 Parameters used within the initialization of the DenMune class
python
def __init__ (self,
train_data=None, test_data=None,
train_truth=None, test_truth=None,
file_2d =None, k_nearest=None,
rgn_tsne=False, prop_step=0,
):

train_data:
 data used for training the algorithm
 default: None. It should be provided by the use, otherwise an error will raise.

train_truth:
 labels of training data
 default: None

test_data:
 data used for testing the algorithm

test_truth:
 labels of testing data
 default: None

k_nearest:
 number of nearest neighbor
 default: 0. the default is invalid. knearest neighbor should be at least 1.

rgn_tsn:
 when set to True: It will regenerate the reduced 2D version of the ND dataset each time the algorithm run.
 when set to False: It will generate the reduced 2D version of the ND dataset first time only, then will reuse the saved exist file
 default: True

file_2d: name (include location) of file used save/load the reduced 2d version
 if empty: the algorithm will create temporary file named '_temp_2d'
 default: None

prop_step:
 size of increment used in showing the clustering propagation.
 leave this parameter set to 0, the default value, unless you are willing intentionally to enter the propagation mode.
 default: 0

Parameters used within the fit_predict function:
python
def fit_predict(self,
validate=True,
show_plots=True,
show_noise=True,
show_analyzer=True
):

validate:
 validate data on/off according to five measures integrated with DenMune (Accuracy. F1score, NMI index, AMI index, ARI index)
 default: True

show_plots:
 show/hide plotting of data
 default: True

show_noise:
 show/hide noise and outlier
 default: True

show_analyzer:
 show/hide the analyzer
 default: True
The Analyzer
The algorithm provide an exploratory tool called analyzer, once called it will provide you with indepth analysis on how your clustering results perform.
Noise Detection
DenMune detects noise and outlier automatically, no need to any further work from your side.
 It plots preidentified noise in black
 It plots postidentified noise in light grey
You can set show_noise parameter to False.
```python
# let us show noise
m = DenMune(train_data=X_train, k_nearest=knn) labels, validity = dm.fit_predict(show_noise=True) ```
```python
# let us show clean data by removing noise
m = DenMune(train_data=X_train, k_nearest=knn) labels, validity = dm.fit_predict(show_noise=False) ```
noisy data  clean data 

Validatation
You can get your validation results using 3 methods
 by showing the Analyzer
 extract values from the validity returned list from fit_predict function

extract values from the Analyzer dictionary  There are five validity measures builtin the algorithm, which are:

ACC, Accuracy
 F1 score
 NMI index (Normalized Mutual Information)
 AMI index (Adjusted Mutual Information)
 ARI index (Adjusted Rand Index)
Knearest Evolution
The following chart shows the evolution of pre and post identified noise in correspondence to increase of number of knn. Also, detected number of clusters is analyzed in the same chart in relation with both types of identified noise.
The Scalability
data size  time 

data size: 5000  time: 2.3139 seconds 
data size: 10000  time: 5.8752 seconds 
data size: 15000  time: 12.4535 seconds 
data size: 20000  time: 18.8466 seconds 
data size: 25000  time: 28.992 seconds 
data size: 30000  time: 39.3166 seconds 
data size: 35000  time: 39.4842 seconds 
data size: 40000  time: 63.7649 seconds 
data size: 45000  time: 73.6828 seconds 
data size: 50000  time: 86.9194 seconds 
data size: 55000  time: 90.1077 seconds 
data size: 60000  time: 125.0228 seconds 
data size: 65000  time: 149.1858 seconds 
data size: 70000  time: 177.4184 seconds 
data size: 75000  time: 204.0712 seconds 
data size: 80000  time: 220.502 seconds 
data size: 85000  time: 251.7625 seconds 
data size: 100000  time: 257.563 seconds 

The Stability
The algorithm is only singleparameter, even more it not sensitive to changes in that parameter, k. You may guess that from the following chart yourself. This is of great benefit for you as a data exploration analyst. You can simply explore the dataset using an arbitrary k. Being Nonsensitive to changes in k, make it robust and stable.
Reveal the propagation
one of the top performing feature in this algorithm is enabling you to watch how your clusters propagate to construct the final output clusters. just use the parameter 'prop_step' as in the following example:
```python dataset = "t7.10k" # data_path = 'datasets/denmune/chameleon/'
# train file data_file = data_path + dataset +'.csv' X_train = pd.read_csv(data_file, sep=',', header=None)
from itertools import chain
# Denmune's Paramaters knn = 39 # number of knearest neighbor, the only parameter required by the algorithm
# create list of differnt snapshots of the propagation snapshots = chain(range(2,5), range(5,50,10), range(50, 100, 25), range(100,500,100), range(500,2000, 250), range(1000,5500, 500))
from IPython.display import clear_output
for snapshot in snapshots:
print ("itration", snapshot )
clear_output(wait=True)
dm = DenMune(train_data=X_train, k_nearest=knn, rgn_tsne=False, prop_step=snapshot)
labels, validity = dm.fit_predict(show_analyzer=False, show_noise=False)
```
Interact with the algorithm
This notebook allows you interact with the algorithm in many aspects:  you can choose which dataset to cluster (among 4 chameleon datasets)  you can decide which number of knearest neighbor to use  show noise on/off; thus you can invesetigate noise detected by the algorithm  show analyzer on/off
How to run and test

Launch Examples in Repo2Docker Binder
Simply use our repo2docker offered by mybinder.org, which encapsulate the algorithm and all required data in one virtual machine instance. All Jupyter notebooks examples found in this repository will be also available to you in action to practice in this respo2docer. Thanks mybinder.org, you made it possible!
2. Launch each Example in Google Research, CoLab
Need to test examples one by one, then here another option. Use colab offered by google research to test each example individually.
Here is a list of Google CoLab URL to use the algorithm interactively

3. Launch each Example in Kaggle workspace
If you are a kaggler like me, then Kaggle, the best workspace where data scientist meet, should fit you to test the algorithm with great experience.
 Dataset  Kaggle URL 
    
 When less means more  kaggle  [![When less means more  kaggle](https://kaggle.com/static/images/openinkaggle.svg)]( https://www.kaggle.com/egyfirst/whenlessmeansmore) 
 Nongroundtruth datasets  kaggle  [![Nongroundtruth datasets](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/detectingnongroundtruthdatasets) 
 2D Shape datasets  kaggle  [![2D Shape datasets  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/detectionof2dshapedatasets) 
 MNIST dataset kaggle  [![MNIST dataset  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/get97usingsimpleyetoneparameteralgorithm) 
 Iris dataset kaggle  [![iris dataset  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/denmuneclusteringirisdataset) 
 Training MNIST to get 97%  [![Training MNIST to get 97%](https://kaggle.com/static/images/openinkaggle.svg)]( https://www.kaggle.com/egyfirst/trainingmnistdatasettoget97) 
 Noise detection  kaggle  [![Noise detection  kaggle](https://kaggle.com/static/images/openinkaggle.svg)]( https://www.kaggle.com/egyfirst/noisedetection) 
 Validation  kaggle  [![Validation  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/validatein5builtinvalidityinsexes) 
 The beauty of propagation  kaggle  [![The beauty of propagation  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/thebeautyofclusterspropagation) 
 The beauty of propagation part2  kaggle  [![The beauty of propagation part 2  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/thebeautyofpropagationpart2) 
 Snapshots of propagation kaggle  [![The beauty of propagation  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/beautyofpropagationpart3) 
 Scalability kaggle  [![Scalability  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/scalabilityvsspeed) 
 Stability  kaggle  [![Stability  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/stabilityvsnumberofnearestneighbor) 
 knearestevolution  kaggle  [![knearestevolution  kaggle](https://kaggle.com/static/images/openinkaggle.svg)](https://www.kaggle.com/egyfirst/knearestevolution) 
How to cite ===== If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Pattern Recognition article
Mohamed Abbas McInnes, Adel ElZoghaby, Amin Ahoukry, *DenMune: Density peak based clustering using mutual nearest neighbors*
In: Journal of Pattern Recognition, Elsevier, volume 109, number 107589.
January 2021
bib
@article{ABBAS2021107589,
title = {DenMune: Density peak based clustering using mutual nearest neighbors},
journal = {Pattern Recognition},
volume = {109},
pages = {107589},
year = {2021},
issn = {00313203},
doi = {https://doi.org/10.1016/j.patcog.2020.107589},
url = {https://www.sciencedirect.com/science/article/pii/S0031320320303927},
author = {Mohamed Abbas and Adel ElZoghabi and Amin Shoukry},
keywords = {Clustering, Mutual neighbors, Dimensionality reduction, Arbitrary shapes, Pattern recognition, Nearest neighbors, Density peak},
abstract = {Many clustering algorithms fail when clusters are of arbitrary shapes, of varying densities, or the data classes are unbalanced and close to each other, even in two dimensions. A novel clustering algorithm “DenMune” is presented to meet this challenge. It is based on identifying dense regions using mutual nearest neighborhoods of size K, where K is the only parameter required from the user, besides obeying the mutual nearest neighbor consistency principle. The algorithm is stable for a wide range of values of K. Moreover, it is able to automatically detect and remove noise from the clustering process as well as detecting the target clusters. It produces robust results on various low and high dimensional datasets relative to several known state of the art clustering algorithms.}
}
Licensing
The DenMune algorithm is 3clause BSD licensed. Enjoy.
Task List
 [x] Update Github with the DenMune sourcode
 [x] create repo2docker repository
 [x] Create pip Package
 [x] create CoLab shared examples
 [x] create documentation
 [x] create Kaggle shared examples
 [x] PEP8 compliant
 [x] Continuous integration
 [x] scikitlearn compatible
 [x] creating unit tests (coverage: 100%)
 [x] generating API documentation
 [ ] create conda package