A Review of Data and Document Clustering pertaining to various Distance Measures

Sumathi  Subbarayan; Hannah Grace Gunaseelan

doi:10.56294/saludcyt2022194

Authors

Sumathi Subbarayan Vellore Institute of Technology Chennai. School of Advanced Sciences, Department of Mathematics. Tamil Nadu, India Author https://orcid.org/0000-0003-1762-1703
Hannah Grace Gunaseelan Vellore Institute of Technology Chennai. School of Advanced Sciences, Department of Mathematics. Tamil Nadu, India Author https://orcid.org/0000-0001-9923-3709

DOI:

https://doi.org/10.56294/saludcyt2022194

Keywords:

Machine Learning, Data Mining, Distance Measure, Similarity Measure and Clustering

Abstract

Data is being generated at an increasing rate in a variety of fields as science and technology advance. The generated data are being saved for future decision-making. Data mining is the process of extracting patterns and useful information from massive amounts of data. The distance measure, which is used to calculate how different two objects are from one another, is one such instrument. We have conducted a comprehensive survey of how the distance measures behave when employed with different algorithms. Furthermore, the effectiveness and performance of some novel similarity measures proposed by other authors are investigated

References

1. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Elsevier; 2011.

2. Tan PN, Steinbach M, Kumar V. Introduction to data mining. Pearson Education India; 2016.

3. Witten IH, Frank E, Hall MA, Pal CJ, Data Mining Working Group. Practical machine learning tools and techniques. In: Data Mining. 2005;2:4.

4. Koutroumbas K, Theodoridis S. Pattern recognition. Academic Press; 2008.

5. Murphy KP. Machine learning: a probabilistic perspective. MIT Press; 2012.

6. Jiang SY, Li X, Zheng Q. Principles and practice of data mining. Publishing House of Electronics Industry; 2013.

7. Manning C, Schutze H. Foundations of statistical natural language processing. MIT Press; 1999.

8. Santini S, Jain R. Similarity measures. IEEE Trans Pattern Anal Mach Intell. 1999; 21(9):871-883. DOI: https://doi.org/10.1109/34.790428

9. Kumar DA, Kannathasan N. A study and characterization of chemical properties of soil surface data using K-means algorithm. In: 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering. IEEE; 2013. pp. 264-270. DOI: https://doi.org/10.1109/ICPRIME.2013.6496484

10. Anderberg MR. Cluster analysis for applications. Probability and Mathematical Statistics: A Series of Monographs and Textbooks. Vol. 19. Academic Press; 2014.

11. Mohammed, N. N., & Abdulazeez, A. M. (2017). Evaluation of partitioning around medoids algorithm with various distances on microarray data. In 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (pp. 1011-1016). IEEE. DOI: https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2017.155

12. Arora, P., & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia Computer Science, 78, 507-512. DOI: https://doi.org/10.1016/j.procs.2016.02.095

13. Azadani, M. N., Ghadiri, N., & Davoodijam, E. (2018). Graph based biomedical text summarization: An itemset mining and sentence clustering approach. Journal of biomedical informatics, 84, 42-58. DOI: https://doi.org/10.1016/j.jbi.2018.06.005

14. Antiqueira, L., Oliveira Jr, O. N., Costa, L. d. F., & Nunes, M. d. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179(5), 584-599. DOI: https://doi.org/10.1016/j.ins.2008.10.032

15. Shirkhorshidi, A. S., Aghabozorgi, S., & Wah, T. Y. (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS one, 10(12), e0144059. DOI: https://doi.org/10.1371/journal.pone.0144059

16. Santos, J. M., & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175-184). Springer, Berlin, Heidelberg. DOI: https://doi.org/10.1007/978-3-642-04277-5_18

17. Kaur, D. (2014). A Comparative Study of various Distance Measures for Software fault prediction. arXiv preprint arXiv:1411.7474.

18. Bouhmala, N. (2016). How good is the euclidean distance metric for the clustering problem. In 2016 5th IIAI international congress on advanced applied informatics (IIAI-AAI) (pp. 312-315). IEEE. DOI: https://doi.org/10.1109/IIAI-AAI.2016.26

19. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley Sons.

20. Taghva, K., & Veni, R. (2010). Effects of similarity metrics on document clustering. In 2010 Seventh International Conference on Information Technology: New Generations (pp. 222-226). IEEE. DOI: https://doi.org/10.1109/ITNG.2010.65

21. Lewis DD. Reuters 21578, Distribution 1.0 Test collection. Available at: www.daviddlewis.com/resources/testcollections/reuters21578.

22. Prasetyo H, Purwarianti A. Comparison of distance and dissimilarity measures for clustering data with mix attribute types. In: The 1st International Conference on Information Technology, Computer, and Electrical Engineering; 2014. p. 276-280. DOI: https://doi.org/10.1109/ICITACEE.2014.7065756

23. Ji J, Bai T, Zhou C, Ma C, Wang Z. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 2013;120:590-596. DOI: https://doi.org/10.1016/j.neucom.2013.04.011

24. Dua D, Graff C. UCI Machine Learning Repository. Available at: http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science; 2019.

25. Huang A. Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference; 2008. p. 9-56.

26. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. 2000.

27. Gupta MK, Chandra P. An empirical evaluation of K-means clustering algorithm using different distance/similarity metrics. In: Proceedings of ICETIT 2019. Springer; 2020. p. 884-892. DOI: https://doi.org/10.1007/978-3-030-30577-2_79

28. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 2010;31(8):651-666. DOI: https://doi.org/10.1016/j.patrec.2009.09.011

29. Saad SM, Kamarudin SS. Comparative analysis of similarity measures for sentence level semantic measurement of text. In: 2013 IEEE international conference on control system, computing and engineering; 2013. p. 90-94. DOI: https://doi.org/10.1109/ICCSCE.2013.6719938

30. Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas 2014;18(3):491-504. DOI: https://doi.org/10.13053/cys-18-3-2043

31. Heidarian A, Dinneen MJ. A hybrid geometric approach for measuring similarity level among documents and document clustering. 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService); 2016 Mar 14-17; San Francisco, USA. p. 142-151. DOI: https://doi.org/10.1109/BigDataService.2016.14

32. Li D, Zeng W, Zhao Y. Note on distance measure of hesitant fuzzy sets. Information Sciences. 2015;321:103-115. DOI: https://doi.org/10.1016/j.ins.2015.03.076

33. Sahu N, Thakur GS. Hesitant distance similarity measures for document clustering. 2011 World Congress on Information and Communication Technologies; 2011 Dec 6-8; Mumbai, India. p. 430-438. DOI: https://doi.org/10.1109/WICT.2011.6141284

34. Irfan S, Ghosh S. Ranking web pages using cosine similarity measure. 2019 International Conference on Computing, Power and Communication Technologies (GUCON); 2019 Dec 13-14; Gurgaon, India. p. 867-870.

35. Sedding J, Kazakov D. Wordnet-based text document clustering. Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004); 2004. p. 104-113. DOI: https://doi.org/10.3115/1621445.1621458

36. Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. 2014 9th International Conference on Industrial and Information Systems (ICIIS); 2014 Dec 12-14; Coimbatore, India. p. 1-5. DOI: https://doi.org/10.1109/ICIINFS.2014.7036661

37. White T. Hadoop: The definitive guide. O’Reilly Media, Inc; 2012.

38. Deng N, Gao Z, Niu K. A Novel Data Dependent Similarity Measure Algorithm Based on Attribute Selection. 2018 IEEE International Conference on Big Data and Smart Computing (BigComp); 2018 Jan 19-22; Hong Kong, China. p. 603-606. DOI: https://doi.org/10.1109/BigComp.2018.00105

39. Zhang M, Zhou Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition. 2007;40(7):2038-2048. DOI: https://doi.org/10.1016/j.patcog.2006.12.019

40. Cunjia F, Wang Y, Bian H. An improved KNN text classification algorithm. Foreign Electronic Measurement Technology. 2015;12:39-43.

41. Desrosiers C, Karypis G. Solving the sparsity problem: collaborative filtering via indirect similarities. 2008.

42. Mu Y, Xiao N, Tang R, Luo L, Yin X. An efficient similarity measure for collaborative filtering. In: Procedia Computer Science. 2019;147:416-421. DOI: https://doi.org/10.1016/j.procs.2019.01.258

43. Zhu S, Liu L, Wang Y. Information retrieval using Hellinger distance and sqrt-cos similarity. In: 2012 7th International Conference on Computer Science Education (ICCSE); 2012:925-929. DOI: https://doi.org/10.1109/ICCSE.2012.6295217

44. McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. 1996. Available from: http://www.cs.cmu.edu/mccallum/bow/

45. Sohangir S, Wang D. Improved sqrt-cosine similarity measurement. Journal of Big Data. 2017;4(1):1-13. DOI: https://doi.org/10.1186/s40537-017-0083-6

46. Vidal R, Ma Y, Sastry SS. Principal component analysis. In: Generalized principal component analysis. Springer; 2016:25-62. DOI: https://doi.org/10.1007/978-0-387-87811-9_2

47. Lu Y, Hou X, Chen X. A novel travel-time based similarity measure for hierarchical clustering. Neurocomputing. 2016;173:3-8. DOI: https://doi.org/10.1016/j.neucom.2015.01.090

48. Lu Y, Wan Y. PHA: A fast potential-based hierarchical agglomerative clustering method. Pattern Recognition. 2013;46(5):1227-1239. DOI: https://doi.org/10.1016/j.patcog.2012.11.017

49. Cai Z, Yang X, Huang T, Zhu W. A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Information Sciences. 2020;508:173-182. DOI: https://doi.org/10.1016/j.ins.2019.08.048

50. Zhang W, Zhao D, Wang X. Agglomerative clustering via maximum incremental path integral. Pattern Recognition. 2013 Nov;46(11):3056-65. DOI: https://doi.org/10.1016/j.patcog.2013.04.013

51. Zhang W, Wang X, Zhao D, Tang X. Graph degree linkage: Agglomerative clustering on a directed graph. European Conference on Computer Vision. 2012;428-441. DOI: https://doi.org/10.1007/978-3-642-33718-5_31

52. Nie F, Wang X, Jordan M, Huang H. The constrained laplacian rank algorithm for graph-based clustering. Proceedings of the AAAI Conference on Artificial Intelligence. 2016;30(1). DOI: https://doi.org/10.1609/aaai.v30i1.10302

53. Peng X, Yu Z, Yi Z, Tang H. Constructing the L2- graph for robust subspace learning and subspace clustering. IEEE Transactions on Cybernetics. 2016 Apr;47(4):1053-66. DOI: https://doi.org/10.1109/TCYB.2016.2536752

54. Martín-del-Campo-Rodríguez C, Sidorov G, Batyrshin I. Enhancement of performance of document clustering in the authorship identification problem with a weighted cosine similarity. Mexican International Conference on Artificial Intelligence. 2018;49-56. DOI: https://doi.org/10.1007/978-3-030-04497-8_4

55. Grace GH, Desikan K. Document clustering using a new similarity measure based on energy of a bipartite graph. Indian Journal of Science and Technology. 2010;9:40.

56. Koolen JH, Moulton V. Maximal energy bipartite graphs. Graphs and Combinatorics. 2003;19(1):131-5. DOI: https://doi.org/10.1007/s00373-002-0487-7

57. Lin YS, Jiang JY, Lee SJ. A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2013 Jul;26(7):1575-90. DOI: https://doi.org/10.1109/TKDE.2013.19

58. Jiang JY, Cheng WH, Chiou YS, Lee SJ. A similarity measure for text processing. 2011 International Conference on Machine Learning and Cybernetics. 2011;1460-5. DOI: https://doi.org/10.1109/ICMLC.2011.6016998

59. Eminağaoğlu M, Gökşen Y. A new similarity measure for document classification and text mining. KnE Social Sciences. 2020;353-66. DOI: https://doi.org/10.18502/kss.v4i1.5999

60. Rocchio JJ. The smart retrieval system: Experiments in automatic document processing. Relevance Feedback in Information Retrieval. 1971;313-23.

A Review of Data and Document Clustering pertaining to various Distance Measures

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite