A Review of Data and Document Clustering pertaining to various Distance Measures
DOI:
https://doi.org/10.56294/saludcyt2022194Keywords:
Machine Learning, Data Mining, Distance Measure, Similarity Measure and ClusteringAbstract
Data is being generated at an increasing rate in a variety of fields as science and technology advance. The generated data are being saved for future decision-making. Data mining is the process of extracting patterns and useful information from massive amounts of data. The distance measure, which is used to calculate how different two objects are from one another, is one such instrument. We have conducted a comprehensive survey of how the distance measures behave when employed with different algorithms. Furthermore, the effectiveness and performance of some novel similarity measures proposed by other authors are investigated
References
1. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Elsevier; 2011.
2. Tan PN, Steinbach M, Kumar V. Introduction to data mining. Pearson Education India; 2016.
3. Witten IH, Frank E, Hall MA, Pal CJ, Data Mining Working Group. Practical machine learning tools and techniques. In: Data Mining. 2005;2:4.
4. Koutroumbas K, Theodoridis S. Pattern recognition. Academic Press; 2008.
5. Murphy KP. Machine learning: a probabilistic perspective. MIT Press; 2012.
6. Jiang SY, Li X, Zheng Q. Principles and practice of data mining. Publishing House of Electronics Industry; 2013.
7. Manning C, Schutze H. Foundations of statistical natural language processing. MIT Press; 1999.
8. Santini S, Jain R. Similarity measures. IEEE Trans Pattern Anal Mach Intell. 1999; 21(9):871-883. DOI: https://doi.org/10.1109/34.790428
9. Kumar DA, Kannathasan N. A study and characterization of chemical properties of soil surface data using K-means algorithm. In: 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering. IEEE; 2013. pp. 264-270. DOI: https://doi.org/10.1109/ICPRIME.2013.6496484
10. Anderberg MR. Cluster analysis for applications. Probability and Mathematical Statistics: A Series of Monographs and Textbooks. Vol. 19. Academic Press; 2014.
11. Mohammed, N. N., & Abdulazeez, A. M. (2017). Evaluation of partitioning around medoids algorithm with various distances on microarray data. In 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (pp. 1011-1016). IEEE. DOI: https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2017.155
12. Arora, P., & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia Computer Science, 78, 507-512. DOI: https://doi.org/10.1016/j.procs.2016.02.095
13. Azadani, M. N., Ghadiri, N., & Davoodijam, E. (2018). Graph based biomedical text summarization: An itemset mining and sentence clustering approach. Journal of biomedical informatics, 84, 42-58. DOI: https://doi.org/10.1016/j.jbi.2018.06.005
14. Antiqueira, L., Oliveira Jr, O. N., Costa, L. d. F., & Nunes, M. d. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179(5), 584-599. DOI: https://doi.org/10.1016/j.ins.2008.10.032
15. Shirkhorshidi, A. S., Aghabozorgi, S., & Wah, T. Y. (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS one, 10(12), e0144059. DOI: https://doi.org/10.1371/journal.pone.0144059
16. Santos, J. M., & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175-184). Springer, Berlin, Heidelberg. DOI: https://doi.org/10.1007/978-3-642-04277-5_18
17. Kaur, D. (2014). A Comparative Study of various Distance Measures for Software fault prediction. arXiv preprint arXiv:1411.7474.
18. Bouhmala, N. (2016). How good is the euclidean distance metric for the clustering problem. In 2016 5th IIAI international congress on advanced applied informatics (IIAI-AAI) (pp. 312-315). IEEE. DOI: https://doi.org/10.1109/IIAI-AAI.2016.26
19. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley Sons.
20. Taghva, K., & Veni, R. (2010). Effects of similarity metrics on document clustering. In 2010 Seventh International Conference on Information Technology: New Generations (pp. 222-226). IEEE. DOI: https://doi.org/10.1109/ITNG.2010.65
21. Lewis DD. Reuters 21578, Distribution 1.0 Test collection. Available at: www.daviddlewis.com/resources/testcollections/reuters21578.
22. Prasetyo H, Purwarianti A. Comparison of distance and dissimilarity measures for clustering data with mix attribute types. In: The 1st International Conference on Information Technology, Computer, and Electrical Engineering; 2014. p. 276-280. DOI: https://doi.org/10.1109/ICITACEE.2014.7065756
23. Ji J, Bai T, Zhou C, Ma C, Wang Z. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 2013;120:590-596. DOI: https://doi.org/10.1016/j.neucom.2013.04.011
24. Dua D, Graff C. UCI Machine Learning Repository. Available at: http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science; 2019.
25. Huang A. Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference; 2008. p. 9-56.
26. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. 2000.
27. Gupta MK, Chandra P. An empirical evaluation of K-means clustering algorithm using different distance/similarity metrics. In: Proceedings of ICETIT 2019. Springer; 2020. p. 884-892. DOI: https://doi.org/10.1007/978-3-030-30577-2_79
28. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 2010;31(8):651-666. DOI: https://doi.org/10.1016/j.patrec.2009.09.011
29. Saad SM, Kamarudin SS. Comparative analysis of similarity measures for sentence level semantic measurement of text. In: 2013 IEEE international conference on control system, computing and engineering; 2013. p. 90-94. DOI: https://doi.org/10.1109/ICCSCE.2013.6719938
30. Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas 2014;18(3):491-504. DOI: https://doi.org/10.13053/cys-18-3-2043
31. Heidarian A, Dinneen MJ. A hybrid geometric approach for measuring similarity level among documents and document clustering. 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService); 2016 Mar 14-17; San Francisco, USA. p. 142-151. DOI: https://doi.org/10.1109/BigDataService.2016.14
32. Li D, Zeng W, Zhao Y. Note on distance measure of hesitant fuzzy sets. Information Sciences. 2015;321:103-115. DOI: https://doi.org/10.1016/j.ins.2015.03.076
33. Sahu N, Thakur GS. Hesitant distance similarity measures for document clustering. 2011 World Congress on Information and Communication Technologies; 2011 Dec 6-8; Mumbai, India. p. 430-438. DOI: https://doi.org/10.1109/WICT.2011.6141284
34. Irfan S, Ghosh S. Ranking web pages using cosine similarity measure. 2019 International Conference on Computing, Power and Communication Technologies (GUCON); 2019 Dec 13-14; Gurgaon, India. p. 867-870.
35. Sedding J, Kazakov D. Wordnet-based text document clustering. Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004); 2004. p. 104-113. DOI: https://doi.org/10.3115/1621445.1621458
36. Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. 2014 9th International Conference on Industrial and Information Systems (ICIIS); 2014 Dec 12-14; Coimbatore, India. p. 1-5. DOI: https://doi.org/10.1109/ICIINFS.2014.7036661
37. White T. Hadoop: The definitive guide. O’Reilly Media, Inc; 2012.
38. Deng N, Gao Z, Niu K. A Novel Data Dependent Similarity Measure Algorithm Based on Attribute Selection. 2018 IEEE International Conference on Big Data and Smart Computing (BigComp); 2018 Jan 19-22; Hong Kong, China. p. 603-606. DOI: https://doi.org/10.1109/BigComp.2018.00105
39. Zhang M, Zhou Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition. 2007;40(7):2038-2048. DOI: https://doi.org/10.1016/j.patcog.2006.12.019
40. Cunjia F, Wang Y, Bian H. An improved KNN text classification algorithm. Foreign Electronic Measurement Technology. 2015;12:39-43.
41. Desrosiers C, Karypis G. Solving the sparsity problem: collaborative filtering via indirect similarities. 2008.
42. Mu Y, Xiao N, Tang R, Luo L, Yin X. An efficient similarity measure for collaborative filtering. In: Procedia Computer Science. 2019;147:416-421. DOI: https://doi.org/10.1016/j.procs.2019.01.258
43. Zhu S, Liu L, Wang Y. Information retrieval using Hellinger distance and sqrt-cos similarity. In: 2012 7th International Conference on Computer Science Education (ICCSE); 2012:925-929. DOI: https://doi.org/10.1109/ICCSE.2012.6295217
44. McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. 1996. Available from: http://www.cs.cmu.edu/mccallum/bow/
45. Sohangir S, Wang D. Improved sqrt-cosine similarity measurement. Journal of Big Data. 2017;4(1):1-13. DOI: https://doi.org/10.1186/s40537-017-0083-6
46. Vidal R, Ma Y, Sastry SS. Principal component analysis. In: Generalized principal component analysis. Springer; 2016:25-62. DOI: https://doi.org/10.1007/978-0-387-87811-9_2
47. Lu Y, Hou X, Chen X. A novel travel-time based similarity measure for hierarchical clustering. Neurocomputing. 2016;173:3-8. DOI: https://doi.org/10.1016/j.neucom.2015.01.090
48. Lu Y, Wan Y. PHA: A fast potential-based hierarchical agglomerative clustering method. Pattern Recognition. 2013;46(5):1227-1239. DOI: https://doi.org/10.1016/j.patcog.2012.11.017
49. Cai Z, Yang X, Huang T, Zhu W. A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Information Sciences. 2020;508:173-182. DOI: https://doi.org/10.1016/j.ins.2019.08.048
50. Zhang W, Zhao D, Wang X. Agglomerative clustering via maximum incremental path integral. Pattern Recognition. 2013 Nov;46(11):3056-65. DOI: https://doi.org/10.1016/j.patcog.2013.04.013
51. Zhang W, Wang X, Zhao D, Tang X. Graph degree linkage: Agglomerative clustering on a directed graph. European Conference on Computer Vision. 2012;428-441. DOI: https://doi.org/10.1007/978-3-642-33718-5_31
52. Nie F, Wang X, Jordan M, Huang H. The constrained laplacian rank algorithm for graph-based clustering. Proceedings of the AAAI Conference on Artificial Intelligence. 2016;30(1). DOI: https://doi.org/10.1609/aaai.v30i1.10302
53. Peng X, Yu Z, Yi Z, Tang H. Constructing the L2- graph for robust subspace learning and subspace clustering. IEEE Transactions on Cybernetics. 2016 Apr;47(4):1053-66. DOI: https://doi.org/10.1109/TCYB.2016.2536752
54. Martín-del-Campo-Rodríguez C, Sidorov G, Batyrshin I. Enhancement of performance of document clustering in the authorship identification problem with a weighted cosine similarity. Mexican International Conference on Artificial Intelligence. 2018;49-56. DOI: https://doi.org/10.1007/978-3-030-04497-8_4
55. Grace GH, Desikan K. Document clustering using a new similarity measure based on energy of a bipartite graph. Indian Journal of Science and Technology. 2010;9:40.
56. Koolen JH, Moulton V. Maximal energy bipartite graphs. Graphs and Combinatorics. 2003;19(1):131-5. DOI: https://doi.org/10.1007/s00373-002-0487-7
57. Lin YS, Jiang JY, Lee SJ. A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2013 Jul;26(7):1575-90. DOI: https://doi.org/10.1109/TKDE.2013.19
58. Jiang JY, Cheng WH, Chiou YS, Lee SJ. A similarity measure for text processing. 2011 International Conference on Machine Learning and Cybernetics. 2011;1460-5. DOI: https://doi.org/10.1109/ICMLC.2011.6016998
59. Eminağaoğlu M, Gökşen Y. A new similarity measure for document classification and text mining. KnE Social Sciences. 2020;353-66. DOI: https://doi.org/10.18502/kss.v4i1.5999
60. Rocchio JJ. The smart retrieval system: Experiments in automatic document processing. Relevance Feedback in Information Retrieval. 1971;313-23.
Published
Issue
Section
License
Copyright (c) 2022 Sumathi Subbarayan, Hannah Grace Gunaseelan (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
The article is distributed under the Creative Commons Attribution 4.0 License. Unless otherwise stated, associated published material is distributed under the same licence.