A Review of Data and Document Clustering pertaining to various Distance Measures
DOI:
https://doi.org/10.56294/saludcyt2022194Keywords:
Machine Learning, Data Mining, Distance Measure, Similarity Measure and ClusteringAbstract
Data is being generated at an increasing rate in a variety of fields as science and technology advance. The generated data are being saved for future decision-making. Data mining is the process of extracting patterns and useful information from massive amounts of data. The distance measure, which is used to calculate how different two objects are from one another, is one such instrument. We have conducted a comprehensive survey of how the distance measures behave when employed with different algorithms. Furthermore, the effectiveness and performance of some novel similarity measures proposed by other authors are investigated
References
1. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Elsevier; 2011.
2. Tan PN, Steinbach M, Kumar V. Introduction to data mining. Pearson Education India; 2016.
3. Witten IH, Frank E, Hall MA, Pal CJ, Data Mining Working Group. Practical machine learning tools and techniques. In: Data Mining. 2005;2:4.
4. Koutroumbas K, Theodoridis S. Pattern recognition. Academic Press; 2008.
5. Murphy KP. Machine learning: a probabilistic perspective. MIT Press; 2012.
6. Jiang SY, Li X, Zheng Q. Principles and practice of data mining. Publishing House of Electronics Industry; 2013.
7. Manning C, Schutze H. Foundations of statistical natural language processing. MIT Press; 1999.
8. Santini S, Jain R. Similarity measures. IEEE Trans Pattern Anal Mach Intell. 1999; 21(9):871-883.
9. Kumar DA, Kannathasan N. A study and characterization of chemical properties of soil surface data using K-means algorithm. In: 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering. IEEE; 2013. pp. 264-270.
10. Anderberg MR. Cluster analysis for applications. Probability and Mathematical Statistics: A Series of Monographs and Textbooks. Vol. 19. Academic Press; 2014.
11. Mohammed, N. N., & Abdulazeez, A. M. (2017). Evaluation of partitioning around medoids algorithm with various distances on microarray data. In 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (pp. 1011-1016). IEEE.
12. Arora, P., & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia Computer Science, 78, 507-512.
13. Azadani, M. N., Ghadiri, N., & Davoodijam, E. (2018). Graph based biomedical text summarization: An itemset mining and sentence clustering approach. Journal of biomedical informatics, 84, 42-58.
14. Antiqueira, L., Oliveira Jr, O. N., Costa, L. d. F., & Nunes, M. d. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179(5), 584-599.
15. Shirkhorshidi, A. S., Aghabozorgi, S., & Wah, T. Y. (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS one, 10(12), e0144059.
16. Santos, J. M., & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175-184). Springer, Berlin, Heidelberg.
17. Kaur, D. (2014). A Comparative Study of various Distance Measures for Software fault prediction. arXiv preprint arXiv:1411.7474.
18. Bouhmala, N. (2016). How good is the euclidean distance metric for the clustering problem. In 2016 5th IIAI international congress on advanced applied informatics (IIAI-AAI) (pp. 312-315). IEEE.
19. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley Sons.
20. Taghva, K., & Veni, R. (2010). Effects of similarity metrics on document clustering. In 2010 Seventh International Conference on Information Technology: New Generations (pp. 222-226). IEEE.
21. Lewis DD. Reuters 21578, Distribution 1.0 Test collection. Available at: www.daviddlewis.com/resources/testcollections/reuters21578.
22. Prasetyo H, Purwarianti A. Comparison of distance and dissimilarity measures for clustering data with mix attribute types. In: The 1st International Conference on Information Technology, Computer, and Electrical Engineering; 2014. p. 276-280.
23. Ji J, Bai T, Zhou C, Ma C, Wang Z. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 2013;120:590-596.
24. Dua D, Graff C. UCI Machine Learning Repository. Available at: http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science; 2019.
25. Huang A. Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference; 2008. p. 9-56.
26. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. 2000.
27. Gupta MK, Chandra P. An empirical evaluation of K-means clustering algorithm using different distance/similarity metrics. In: Proceedings of ICETIT 2019. Springer; 2020. p. 884-892.
28. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 2010;31(8):651-666.
29. Saad SM, Kamarudin SS. Comparative analysis of similarity measures for sentence level semantic measurement of text. In: 2013 IEEE international conference on control system, computing and engineering; 2013. p. 90-94.
30. Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas 2014;18(3):491-504.
31. Heidarian A, Dinneen MJ. A hybrid geometric approach for measuring similarity level among documents and document clustering. 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService); 2016 Mar 14-17; San Francisco, USA. p. 142-151.
32. Li D, Zeng W, Zhao Y. Note on distance measure of hesitant fuzzy sets. Information Sciences. 2015;321:103-115.
33. Sahu N, Thakur GS. Hesitant distance similarity measures for document clustering. 2011 World Congress on Information and Communication Technologies; 2011 Dec 6-8; Mumbai, India. p. 430-438.
34. Irfan S, Ghosh S. Ranking web pages using cosine similarity measure. 2019 International Conference on Computing, Power and Communication Technologies (GUCON); 2019 Dec 13-14; Gurgaon, India. p. 867-870.
35. Sedding J, Kazakov D. Wordnet-based text document clustering. Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004); 2004. p. 104-113.
36. Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. 2014 9th International Conference on Industrial and Information Systems (ICIIS); 2014 Dec 12-14; Coimbatore, India. p. 1-5.
37. White T. Hadoop: The definitive guide. O’Reilly Media, Inc; 2012.
38. Deng N, Gao Z, Niu K. A Novel Data Dependent Similarity Measure Algorithm Based on Attribute Selection. 2018 IEEE International Conference on Big Data and Smart Computing (BigComp); 2018 Jan 19-22; Hong Kong, China. p. 603-606.
39. Zhang M, Zhou Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition. 2007;40(7):2038-2048.
40. Cunjia F, Wang Y, Bian H. An improved KNN text classification algorithm. Foreign Electronic Measurement Technology. 2015;12:39-43.
41. Desrosiers C, Karypis G. Solving the sparsity problem: collaborative filtering via indirect similarities. 2008.
42. Mu Y, Xiao N, Tang R, Luo L, Yin X. An efficient similarity measure for collaborative filtering. In: Procedia Computer Science. 2019;147:416-421.
43. Zhu S, Liu L, Wang Y. Information retrieval using Hellinger distance and sqrt-cos similarity. In: 2012 7th International Conference on Computer Science Education (ICCSE); 2012:925-929.
44. McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. 1996. Available from: http://www.cs.cmu.edu/mccallum/bow/
45. Sohangir S, Wang D. Improved sqrt-cosine similarity measurement. Journal of Big Data. 2017;4(1):1-13.
46. Vidal R, Ma Y, Sastry SS. Principal component analysis. In: Generalized principal component analysis. Springer; 2016:25-62.
47. Lu Y, Hou X, Chen X. A novel travel-time based similarity measure for hierarchical clustering. Neurocomputing. 2016;173:3-8.
48. Lu Y, Wan Y. PHA: A fast potential-based hierarchical agglomerative clustering method. Pattern Recognition. 2013;46(5):1227-1239.
49. Cai Z, Yang X, Huang T, Zhu W. A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Information Sciences. 2020;508:173-182.
50. Zhang W, Zhao D, Wang X. Agglomerative clustering via maximum incremental path integral. Pattern Recognition. 2013 Nov;46(11):3056-65.
51. Zhang W, Wang X, Zhao D, Tang X. Graph degree linkage: Agglomerative clustering on a directed graph. European Conference on Computer Vision. 2012;428-441.
52. Nie F, Wang X, Jordan M, Huang H. The constrained laplacian rank algorithm for graph-based clustering. Proceedings of the AAAI Conference on Artificial Intelligence. 2016;30(1).
53. Peng X, Yu Z, Yi Z, Tang H. Constructing the L2- graph for robust subspace learning and subspace clustering. IEEE Transactions on Cybernetics. 2016 Apr;47(4):1053-66.
54. Martín-del-Campo-Rodríguez C, Sidorov G, Batyrshin I. Enhancement of performance of document clustering in the authorship identification problem with a weighted cosine similarity. Mexican International Conference on Artificial Intelligence. 2018;49-56.
55. Grace GH, Desikan K. Document clustering using a new similarity measure based on energy of a bipartite graph. Indian Journal of Science and Technology. 2010;9:40.
56. Koolen JH, Moulton V. Maximal energy bipartite graphs. Graphs and Combinatorics. 2003;19(1):131-5.
57. Lin YS, Jiang JY, Lee SJ. A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2013 Jul;26(7):1575-90.
58. Jiang JY, Cheng WH, Chiou YS, Lee SJ. A similarity measure for text processing. 2011 International Conference on Machine Learning and Cybernetics. 2011;1460-5.
59. Eminağaoğlu M, Gökşen Y. A new similarity measure for document classification and text mining. KnE Social Sciences. 2020;353-66.
60. Rocchio JJ. The smart retrieval system: Experiments in automatic document processing. Relevance Feedback in Information Retrieval. 1971;313-23.
Published
Issue
Section
License
Copyright (c) 2022 Sumathi Subbarayan, Hannah Grace Gunaseelan (Author)
This work is licensed under a Creative Commons Attribution 4.0 International License.
The article is distributed under the Creative Commons Attribution 4.0 License. Unless otherwise stated, associated published material is distributed under the same licence.