REVIEW
Big Data and Different Subspace Clustering Approaches: From social media promotion to genome mapping
Big Data y diferentes enfoques de clustering subespacial: De la promoción en redes sociales al mapeo genómico
Vijaya Kishore Veparala1 *,
Vattikunta Kalpana1
*
1Department of ECE, Mohan Babu University, Tirupati, A.P, India.
Cite as: Kishore Veparala V, Kalpana V. Big Data and Different Subspace Clustering Approaches: From social media promotion to genome mapping. Salud, Ciencia y Tecnología. 2023;3:413. https://doi.org/10.56294/saludcyt2023413
Received: 27-04-2023 Reviewed: 13-05-2023 Accepted: 19-06-2023 Published: 20-06-2023
ABSTRACT
In the present age of information technology, information is the most important factor in determining how different paradigms will progress. This information needs to be mined out of a massive computer treasure trove. The rise in the amount of data been analyzed and interpreted is a direct result of the proliferation of more powerful processing platforms, the increase in the amount of storage space available, and the transition toward the use of electronic platforms. A thorough study of Big Data, its characteristics, and the role that Subspace clustering algorithm plays is described in this work. The most important contribution that this paper makes is that it reads a lot of previous research and then makes a thorough presentation about the different ways that other authors have classified subspace clustering methods. In addition, significant algorithms that are capable of acting as a benchmark for any future development have been provided with a short explanation.
Keywords: Big Data; Clustering; Subspace; Classification; Integrative Review.
RESUMEN
En la era actual de las tecnologías de la información, la información es el factor más importante para determinar cómo progresarán los distintos paradigmas. Esta información debe extraerse de un enorme tesoro informático. El aumento de la cantidad de datos analizados e interpretados es consecuencia directa de la proliferación de plataformas de procesamiento más potentes, el incremento del espacio de almacenamiento disponible y la transición hacia el uso de plataformas electrónicas. En este trabajo se describe un estudio exhaustivo de Big Data, sus características y el papel que desempeña el algoritmo de clustering Subspace. La contribución más importante que hace este trabajo es que lee muchas investigaciones anteriores y luego hace una presentación exhaustiva sobre las diferentes formas en que otros autores han clasificado los métodos de clustering subespacial. Además, se han proporcionado, con una breve explicación, algoritmos significativos que pueden servir de referencia para cualquier desarrollo futuro.
Palabras clave: Big Data; Clustering; Subespacio; Clasificación; Revisión Integradora.
INTRODUCTION
In this era of constant communication and computing, data is widely recognized as one of the most valuable forms of property. One way to refer to it is as a collection of variables and values that, in some instances, are similar to one another to a certain degree and, in other cases, are dissimilar to one another to a certain degree. The database’s size has significantly grown in tandem with the exponential growth in the capacity of recording devices. The spread of computing platforms in the form of smart phones has resulted in the accidental collection and storage of a vast treasury of data. Because the information contained in this data can be useful, there are now more instruments available than ever before that are able to rapidly extract useful information from large amounts of data. This is a direct consequence of the previous sentence.(1) One definition of a database describes it as an organized collection of data that is simple to manage, acquire, and keep updated on, while another defines data mining as the process of finding information that is relevant to the user and of relevance to them. This knowledge includes the ability to extract from large quantities of data details about related patterns, anomalies, and important structures that are stored in computerized formats such as data warehouses, databases, and other types of information archives. Therefore, data mining (DM), which is referred to as Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is a method of looking through enormous amounts of data in an automatic fashion for structures that involve association rules.(2) It involves employing a large number of different computational methods, such as pattern identification, information extraction, machine learning, and statistics, amongst others. The primary objective of data mining is to quickly select only the essential patterns from a database while minimizing the amount of time spent doing so. Data mining activities can be broken down into the categories of summarization, classification, clustering, association, and trend analysis,(2) varying according to the kinds of patterns that are desired to be extracted.
Owing to the fact that the quantity of data is growing at an exponential pace, improved evaluation is needed to extract data that most closely corresponds to the interests of individual users. Enormous amounts of data are created each minute and in fact each second. In the context of this discussion, the term “big data” refers to datasets which expand at an alarming rate and are of a size that exceeds the capacity of traditional database tools that can be used to store, organize, and conduct analysis on the data. The phrase “big data” refers to a compilation of data that contains information that is both structured and unstructured. Some of the factors that can be attributed to the tremendous development of big data include accessibility of information, a rise in the capability of storage, and a rapid rise in the processing capacity of computing platforms. These are just some of the factors that have contributed to the rise in popularity of big data. Big data refers to the practice of utilizing large data sets in order to efficiently handle gathering or publishing of information that assists businesses or other individuals in making decisions. The information could be specific to the business, or it could be general. It could also be private or open to the public.(3)
The process of finding by means of enormous amounts of data for information that is relevant to a particular question or problem and then collecting that information is known as "big data mining. “Big data samples are used in many areas, including astronomy, atmospheric science, social networking sites, life sciences, medical science, government data, natural disasters and resource management, web logs, mobile phones, sensor networks, scientific research, telecommunications, and more.(4) In the endeavor to analyze and understand big data clustering makes up one of the most important things you can have. Research has paid a lot of attention to the topic of data arrangement that has been done on it as a result of the enormous number of applications it has in summarizing, learning, dividing into groups, and selling to specific groups.(5,6,7) Clustering can be thought of as a simplified model of the data when no specific information is given. Depending on the situation, this model can be seen as either a recap or a way to come up with new ideas. Given a set of data points, the main task of clustering is to "split them up into groups those are as identical to each other as is humanly possible".(7)
It is imperative to make modifications to preexisting algorithms to maintain the quality and speed of clusters in light of the emergence of big data and the fact that datasets are growing both in size and in variety. When it comes to traditional clustering algorithms, each and every dimension of a data collection is examined and considered. This is done to increase the likelihood of gleaning as much information as possible from each of the objects being mentioned. However, in high-dimensional data, many of the dimensions are frequently extraneous to the problem. These unimportant dimensions when you try to conceal clusters by using data with a lot of noise, it can cause the clustering algorithms to produce incorrect clusters. When there are a lot of variables in a dataset, it is not uncommon for all the objects to be nearly located at the same distance from one another, which completely masks the clusters. Methods of feature selection have been successfully implemented, to a certain degree, to bring about an improvement in the overall quality of the clusters. The Subspace(6) clustering approach is a very good way to group the complex information that is often found in big data. In contrast to feature selection methods, which focus on analyzing the information in its entirety, subspace clustering algorithms narrow their focus to a specific region. As a direct consequence of this, these algorithms can identify clusters that can be found in a great variety of subspaces, some of which may intersect with one another.
The term "big data" can have a variety of meanings, and the author of this work investigates those meanings as well as the challenges that are associated with big data analysis. This contains the characteristics that are employed in defining "Big Data," as well as its significance and those characteristics. The primary purpose of this study is to cast light on various kinds of Subspace clustering techniques that are already present in the existing body of research. In the same way that there are multiple descriptions for "Big data," there are numerous classifications of "Subspace clustering algorithm," as is generally agreed upon in the academic research community. This article delves into the various classification strategies that have been utilized by a variety of researchers over the course of some time to classify and categorize different subspace clustering strategies. This level of comprehension is necessary for the development of any new strategies that might be of assistance in the examination and interpretation of big data. A concise description of significant Subspace algorithms, which can act as a benchmark for any further development of algorithms, is also provided in this article. After an introduction in the first section, a discussion of the various classifications of big data is presented in the second section, which is followed by a discussion of the difficulties presented by high-dimensional data clustering in the third section. Section 4 provides an explanation of a literature review pertaining to the classification of subspace clustering approaches, and section 5 provides a conversation about significant existing approaches. The references that were used are presented in section 6, while the conclusions themselves are presented in section 7.
Development
Definition of big data
Big data can be defined in a variety of ways, and the first quality that comes to mind when asking "what is big data?" is, without a doubt, its magnitude. But there are other aspects of big data that can be considered as well. In contrast, other aspects of big data have begun to appear over the past few years. According to Laney(8), the three dimensions that need to be dealt with in data administration are volume, variety, and velocity (also known as the Three V's). Big data can now be defined by its structure, which is referred to as the "Three V's".(9,10) Many of the world's most prestigious institutions and businesses have been instrumental in determining the characteristics of big data. Big data was defined by Gartner, Inc. The definition of "big data" is as follows: "High-volume, high-velocity, and high-variety data assets that call for cost-effective, innovative forms of processing of data for enhanced insight and decision making".(11) Tech America Foundation gave a similar definition: "Big data is a term for large amounts of fast-moving, complex, and different data that require modern techniques and methods to capture, store, distribute, manage, and analyze".(12) Based on these definitions, it is reasonable to conclude that the three most important aspects of big data are volume, diversity, and velocity. These three essential characteristics describe and establish the direction of the various approaches that can be taken to analyze large amounts of data.
The quantity of material is referred to as its volume. The sizes of large amounts of data are typically published in multiple terabytes and petabytes. One petabyte equals 1024 gigabytes. There is no definition of big data volumes that is universally recognized and acknowledged; rather, these definitions are relative and depend on a variety of variables including the time as well as the kind of info. What does "big data" mean in this day and age might not fulfill the requirements in the years to come? It is inevitable that in the future, because the amount of storage will increase, which will make it possible to record even more expansive data collections. In addition to this, it is absolutely necessary to have a solid comprehension of the role that variation plays when deciding the overall size of a data set. The nature of the data introduces a fresh obstacle when attempting to specify its quantity. The question of whether or not a data collection constitutes a "big" or "small" one arises whenever two datasets of comparable size call for management strategies that couldn't possibly be more dissimilar from one another. As was mentioned earlier a substantial amount of progress toward the development of big data was made by businesses. In addition to this, the kind of business also plays a substantial role in determining the benchmarks for the volume of big data.
The different kinds of structures that can be found in a dataset are referred to collectively as the "variety" of that dataset. The advancement of technology has made it possible to use a broad variety of data formats, including structured, semi-structured, and unstructured data. This has made it possible to store and retrieve data in a more efficient manner. A great example of structured data is the tabulated information that can be found in relational databases or spreadsheets. Structured data only accounts for 5 % of all the data that is currently available.(13) Textual content, pictures, music, video are all instances of the types of data that make up the majority of unstructured data. There are times when these kinds of data do not have the structural organization that is absolutely necessary for the analysis of automated machines. The format of the material that is semi-structured does not comply with any specific standards. A common illustration of data that is semi-structured is the Extensible Markup Language (XML), which is a written language used for the exchange of data on the web. Documents written in XML have data tags that are user-defined, which enables machines to understand them.
The term "velocity" is used in an idiomatic sense to refer to rate at which data are formed as well as the swiftness with which they must be processed, evaluated, and interpreted. In conjunction with the proliferation of digital devices, the requirement for real-time analytics and situation-based planning has resulted in a rate of data generation that has never been seen before. Even the most traditional shopping establishments are producing high-frequency data. For example, Wal-Mart is capable of handling nearly one million sales every single hour.(13) The torrents of information that are produced by the information that comes from smartphones and mobile apps is currently being used to make real-time, personalized offers for users as they go about their daily lives.
In the study that has been done so far, besides the three Vs, other aspects of big data were also talked around. Those terms were developed in a very specific way by titans of the information technology industry, who play a pivotal role in describing the generation, need, and use of big data. IBM came up with the idea for the fourth "V," which stands for "veracity," which describes the intrinsic unreliability of certain data sources. This is meant to symbolize the unpredictability that comes with the natural decision-making process, as well as other factors such as human judgment. There is a high probability that this kind of unpredictability will be present in the vast majority of the data that is produced by social media. Dealing about unclear and questionable data is important because, even though they hold important information, they can be wrong in a number of ways. This is another part of big data that requires to be dealt with, and it can be done with the help of tools and analytics that have been made just for managing and mining data that isn't clear.
The company SAS was the first to introduce the concept of changeable data and challenging information as two new categories of "big data." The word "variability" refers to how the rates at which data is collected change, which is flowing, as it is frequently observed that the velocity of big data is not constant but rather exhibits periodic peaks and valleys. The word "complexity" describes the fact that a lot of different types of sources make a lot of data. This is a big problem because it means that data from different sources needs to be linked, matched, cleaned, and changed in some way. Value incorporated Oracle is frequently cited as one of the distinguishing characteristics. The term "low value density" is used to describe the standard characteristics of big data sets, according to Oracle's definition. To put it another way, the worth of the data when it is still in its original form is typically quite low in comparison to the amount of space it takes up. On the other hand, one can acquire a high value by analyzing a large amount of data of this kind.
High dimensional clustering challenges
Clustering high-dimensional data is difficult for a number of reasons, the most significant of which are as follows:
Curse of Dimensionality or Sparse Data: when machine learning algorithms are used on data that has a high dimensionality, an occurrence that is referred to as the Curse of Dimensionality occur.(6) Was the first person to describe this phenomenon with the name? The inability of clustering algorithms to successfully deal with data sets that contain a large number of dimensions is what's meant when people talk about the "curse of dimensionality." Figure 1 illustrates the "curse of dimensionality," which states that the number of regions will increase at an exponential rate if the number of dimensions is increased. This can be seen by looking at the figure 1.
Figure 1. Curse of Dimensionality(14)
This can be understood by considering the fact that the significance of distance measures in data points decreases as the dimension of the dataset increases. This is something that can be explained. This is something that can be attributed to the fact that the spread of the data that occurs as a result of the increased addition of dimensions eventually reaches a place where, in very high dimensions, the data are nearly equally distant from one another. The issue becomes even more complicated when the objects in question are connected to one another in a variety of different subdivisions of dimensions. An algorithm called subspace clustering has a tendency to address and discover this kind of relationship. It is essential to get rid of any superfluous characteristics in order to guarantee that the method for clustering can concentrate solely on the important dimensions. This can be accomplished by removing any irrelevant characteristics. Clusters that are located in small dimensional spaces have a tendency to be easier to understand, which enables the individual using it to more effectively direct the path that the subsequent investigation takes.
Figure 2. Classifier performance vs. Dimensionality(14)
In a similar fashion, the effectiveness of the classification rises along with the dimensionality of the data that is available to it. Up until the point where the optimal number of features is attained. Figure 2 depicts this truth for your viewing convenience. Any subsequent rise in dimension that fails to outcome in an increase in the overall number of training samples will result in a decline in the performance of the classifier. In the examination of big data, another essential facet that must be taken into consideration is the aforementioned matter.
Irrelevant Dimensions: while working with high-dimensional data it is quite normal for a significant number of the dimensions to be unnecessary for the clustering or analysis of the data.(15) In situations like this one, one of the most common solutions is to reduce the dimensionality of the data while maintaining the integrity of the essential information. As a result, the process of clustering generally begins with a phase that is known as "feature selection," that endeavors to strip the data of characteristics that are not important to its interpretation. However, because the clusters are embedded in a variety of subspaces, the global filtering approach to feature selection is not practical when working with high-dimensional data. When clustering the data, one dimension may be helpful in certain subspace formations; however, in other subspace formations, it may be completely meaningless.
Correlations among Dimensions: the presence of correlation between attributes is typically observed across a significant number of categories. Therefore, it's possible that the groups aren't lined up with their axes parallel. Instead, they could be arranged in any way that someone wants. Because of these issues, the data universe has an average density that is relatively low. Not only is the data space not very full, but the noise values are also spread out evenly across the high-dimensional space.(16) When applied to high-dimensional data, the conventional clustering methods will prove to be inefficient when used to search for clusters.
Classification of subspace clustering approaches
Both techniques for transforming features and techniques for selecting features are included in the methodologies that can be used to cluster high-dimensional data. In the case of Feature transformation approaches, a dataset is broken down by combining the original attributes in various ways so as to reduce the total entirety dimensions that are used. Methods are helpful in illuminating the hidden structure that exists within databases. Because this method keeps the distances between items the same, it works less well when there are a lot of irrelevant attributes in the data. This is the primary drawback of such an approach. Additionally, the new features are composites of the old ones, causing it to be potentially very difficult to comprehend new characteristics when placed within the framework of the domain. On the other hand, feature selection approaches center their attention on the dimensions of a dataset that are the most pertinent to the function of bringing to light groups of things that are comparable in only a selected subset of their properties or characteristics. This method works well with many different kinds of databases, however, it may be difficult to identify them using this technique if the clusters are distributed across multiple subspaces. The algorithms that deal with subspace are perfectly capable of processing this kind of input. The concepts of selecting features are expanded upon by these algorithms, taking them one step further, which selects pertinent subspaces for each cluster individually.
The process of subspace clustering is an addition to the feature decision procedure which searches for groups within the numerous subspaces that make up a single dataset. The process of subspace clustering requires a search technique in addition to evaluation criteria, just like feature selection does. In addition, the scope of the assessment criteria needs to be restricted in some way for subspace clustering in order to take into account different subspaces for each of the distinct clusters. Existing subspace clustering methodologies are amenable to categorization via a wide variety of categorization methods. Lance et al.(17) said that there are two main ways to cluster subspaces: top down and bottom up, which are distinguished by the direction in which the search is conducted. In addition, the top-down technique can be subdivided further according to both cluster weighting techniques and per instance weighting methods are available. The authors have utilized a classification strategy that is based on a grid. In this technique, the groups are further sub classified depending on the dimension of the array using either a static or an adaptive grid strategy. It was not possible to categorize the bottom-up approaches with absolute certainty using either the grid-based or density-based frame works Ilango et al.(18) provided a classification in which multidimensional clustering techniques were divided into the following categories: partitioning techniques, density-based techniques, hierarchical techniques, grid-based techniques, and model-based techniques.
Subspace clustering was divided into two categories by Karlton et al.(19) research: Clustering based on density as well as clustering that was anticipated. The researchers referenced density-driven clustering techniques such as CLIQUE in their work. (Clustering In QUEst),(15) MAFIA (Merging Adaptive Finite Intervals and is more than a clique),(20) SUBCLU (density connected SUBspace CLUstering)(21) have the abundance of the data as the basis for their clustering. Similar clustering predicted by models is seen in methods such as PROCLUS. (PROjected CLUStering),(22) CLARANS,(23) ORCLUS (arbitrarily Oriented projected CLUStering),(24) DOC (Density based Optimal projective Clustering),(25) etc.
Clustering techniques for data with high dimensions that are predicated depending on the perspective of the data, some clustering methods include subdomain clustering, which is predicated upon axes parallel clustering; correlation clustering, which is predicated on arbitrarily directed clustering; and pattern-based clustering, which is predicated on the data being organized in a certain way, which is predicated on axis parallel clustering, were discussed in Kriegel et al.(26) Finding clusters that can occur in arbitrarily oriented subspaces is the goal of the correlation-based methods, of which ORCLUS(24) is one example. These methods are used to accomplish this. p-Cluster(27) is a method for pattern-based clustering that groups together the objects that demonstrate a subset of characteristics that share comparable tendencies.
The problem-based categorization that utilizes the axis parallel subspace clustering method provides generates as a result leads to classifications including composite algorithms, projected clustering, and soft projected clustering. The method known as PreDeCon is an illustration of a projected clustering strategy,(28) which stands for subspace PREference weighted Density CONnected clustering. With the help of this method, a one-of-a-kind distribution of each object to precisely one subspace cluster or cacophony can be found.(29) The context of the algorithms for gentle extended clustering the overall amount of clusters, denoted by k, is chosen in preparation, and a target expression is constructed in order to maximize the formation of k-numbers of clusters. This is done in order to achieve the best possible results. This is done so as to ensure that the clustering process operates as successfully as possible.(30) The algorithm COSA (Clustering Objects on Subsets of Attributes) is a good illustration of a fuzzy projected clustering algorithm.(31) A different subspace clustering algorithm known as SUBCLU(31) works toward the goal of locating all subspaces that contain groupings that can be identified.
FIRES are an example of a hybrid algorithm, which is a classification that relates to the algorithms that are able to identify overlapping clusters. This category of algorithms is called hybrid algorithms. (FIlter REfinement Subspace clustering).(26) In Müller et al.(32), an additional classification subspace clustering technique that depends on the parameterization of the of results is described.
The methodologies were broken down into three categories: cell-based, density-based, and clustering-oriented methodologies. CLIQUE(15) is an example of a cell-based strategy that searches collection of fixed or variable cells in a grid that exceeds a certain threshold in terms of the number of objects they comprise. Similar to SUBCLU(31) clusters are defined by the density-based approach as crowded areas that are divided by thin regions. Similarly, PROCLUS,(22) which is an approach that is oriented toward clustering, defines properties of the complete set of clusters. A few examples of these characteristics are the typical number of the clusters, the total number of clusters that make up the dataset, or properties that have a statistical orientation.
Important subspace clustering algorithms
In this section, we take a look at a few of the most important subspace clustering algorithms that were discovered in published research for a limited amount of time. These algorithms were a significant contributor to the development of novel methodologies. It is easy to see that most of the algorithms are merely adjustments and modifications of one of the pre-existing methods.
CLIQUE(15), in order to identify clusters that are contained within the subdomains of the dataset, one of the earliest algorithms makes use of a method known as APRIORI. In order to locate clusters, this algorithm utilizes a hybrid approach that incorporates density-based clustering and grid-based clustering. This technique makes use of coverage, which can be understood as a percentage of the raw data that is comprised of small quantities that are located within the subdomain, in order to determine which clusters actually exist. After the dense subspaces have been identified, they are arranged according to their covering. Only the subspaces that have the greatest amount of covering are retained, while the others are deleted. After completing this stage of the process, the algorithm then utilizes a depth-first search strategy in order to find neighboring grid elements packed closely together in all of the subdomains that were selected in the previous phase. After that, clusters are produced by integrating these units in order to achieve a greedy development strategy. The process begins with a randomly selected dense block, and then it from there, it expands a maximal region in an increasingly wasteful manner until it achieves the largest size that is feasible for it. This development is carried out in every conceivable facet up until the point at which the totality of the cluster is encompassed by the combined efforts of all of its parts. After repeating the process, the duplicate parts of the smallest size are eliminated one by one until there are no more maximal regions left to eliminate. An expression known as the Disjunctive Normal Form (DNF) is then used to describe the hyper-rectangular clusters.
CLIQUE is able to locate a wide variety of clusters, regardless of their shape, and can display the results in a manner that is simple to understand. CLIQUE has a technique for generating clusters that is based on the increasing density of regions, which enables it to locate clusters of arbitrary shapes. This gives CLIQUE the capacity to identify groupings of any form. Additionally, CLIQUE has a technique for generating clusters that is based on the increasing density of regions, which enables it to locate clusters of arbitrary shapes. This gives CLIQUE the skill of locating concentrations of any form. Clusters can be located in the same subspace as one another, in overlapping subspaces, or in subspaces that are not connected. The DNF formulations that are utilized to symbolically represent clusters are typically very simple and straightforward to comprehend. This is useful information for the subspace clustering process because clusters frequently exist across various subspaces and, as a result, symbolize a variety of different relationships. CLIQUE, like other bottom-up algorithms, has high levels of scalability in proportion to the amount of examples and variables that are present in the collection. CLIQUE, along with other methods of a similar nature, does not, however, increase very well with the total amount of variables in its result clusters. This is due to the fact that CLIQUE was designed to handle a much smaller number of dimensions. Due to the fact that subspace clustering is typically utilized in order to locate low dimensional groupings among high dimensional data, this is not typically a significant problem.
A subspace clustering technique known as ENCLUS(33) is strongly influenced by the algorithm known as CLIQUE. One of the most significant distinctions between ENCLUS and CLIQUE is that ENCLUS doesn't directly measure abundance or cover. Instead, it counts the data's entropy. CLIQUE does so. The fundamental idea that underlies the functioning of ENCLUS is that, on average, a subspace that includes clusters will have less entropy than one that does not compared to a subspace and that does not contain clusters. The cluster ability of a subspace can be described with the help of these three criteria: coverage, density, and correlation. Entropy is a useful tool for evaluating all three of these parameters simultaneously. It is also essential to keep in mind that the entropy of a system diminishes as the density of its constituent cells rises. However, under particular circumstances, entropy can also diminish in response to an increase in coverage in a similar vein, interest is a definition of association that can be characterized as the variation in a number of variance values for a collection of variables as well as the overall entropy of the complex distribution. This difference can be calculated using the formula. To put it another way, interest is a measurement of the strength of the correlation between two different sets of characteristics. Values closer to one another indicate a higher degree of correlation between the parameters. Whereas a value of zero indicates that the dimensions are independent. In order to mine significant subspaces, ENCLUS takes the same bottom-up strategy as CLIQUE, which is inspired by the APRIORI style. Finding subspaces with the lowest possible correlation is the goal of the pruning process, which is accomplished by combining the entropy's downward closure property with the correlation's upward closure property of interest.
The top-down subspace clustering algorithm known as PROCLUS(22) was the first of its kind. In a manner analogous to that of CLARANS(34), PROCLUS takes a random selection of the data, after which it chooses a group of k medoids and continually enhances the clustering. Initialization, repetition, and cluster refinement are the three components that make up the algorithm's distinct phases of operation. When it comes time to begin the process of initialization, In order to make room for additional ones, a greedy algorithm is used to select a collection of potential medoids that are situated at a considerable distance from one another in order to clear some space. This is done in order to guarantee that each cluster will have at least a single instance to symbolize it within the set that is selected. During the repetition phase, a random collection of k medoids is chosen from the smaller dataset that was utilized in the previous step. During the cluster refinement phase, poor medoids are exchanged for new, arbitrarily selected medoids, and it is determined whether or not the clustering has been improved. The average distance that separates each instance from the medoid that is geographically nearest to it can be used to evaluate the quality of a cluster.
A collection of measurements is chosen for each medoid whose average distances are significantly closer together than what a statistical study would lead one to anticipate they should be. It is necessary for there to be k*l total dimensions associated with medoids, where l is an input value that establishes the usual number of lengths that cluster subspaces have. After the subspaces for each medoid have been selected, the average Manhattan segmental distance will be utilized in order to allocate points to the medoids, which will ultimately result in the formation of clusters. The medoid of the cluster that has the fewest number of points is discarded, along with any other medoids associated with fewer points than (N/k) *min Deviation. The term "min Deviation" refers to the name of the statistic that was utilized in the process of inputting the information. This occurs after determining which cluster has the fewest number of points. During the phase known as "refinement," in order to calculate more dimensions for each medoid, the clusters that have already been generated are utilized as a starting point. Additionally, during this phase, outliers are eliminated, and their scores are redistributed to the medoids. PROCLUS, like many other top-down techniques, has a strong preference for clusters that have a hyper-spherical shape. In addition, even though clusters can be discovered in various subspaces, those subspaces have to have sizes that are comparable to one another mainly due to the fact that the individual is required to input the typical amount of dimensions for the clusters. Clusters can be thought of as collections of instances, with each individual instance having its own unique medoids and subspaces. These medoids and subspaces generate distinct subdivisions of the dataset, which may also include outliers in their composition.
DBSCAN(35), density-based clustering algorithms look for clusters in a location based on the number of data points that are concentrated there, the core idea that underpins density-based clustering is that the area surrounding a given radius (Eps) must have at least the threshold number that was specified of instances for each individual cluster instance. This is the foundational concept that underpins density-based clustering (MinPts). The DBSCAN is an example of several of the most common approaches to density-based clustering that are currently in use,(35) which can be found here.
The data points are arranged in DBSCAN according to one of three different categories.
Core points: these are areas that are physically located in greater proximity to the core of a cluster. When there are sufficient points surrounding a point, we refer to that point as an internal point.
Border points: the term "boundary point" refers to any location on the map that is not a central point. This indicates that there are insufficient points in the neighborhood of the boundary point; however, the boundary point is still in close proximity to the central location where the action takes place.
Noise points: a noise point is any location in an area that does not qualify as either the center or a boundary point in the area.
DBSCAN begins with an arbitrary case in the data collection (D), and then obtains all instances of D with respect to Eps and MinPts in order to identify a cluster. This process is repeated until a cluster is found. This process allows DBSCAN to locate the cluster. In order to determine which locations are inside Eps separation from the central nodes of the clusters, the method employs a spatial data arrangement known as R*tree(36), which stores information in a tree-like format. It was demonstrated that a modified variant of DBSCAN called an incremental DBSCAN could achieve the same level of success as the original DBSCAN algorithm. In addition, another clustering algorithm known as GDBSCAN is described in Ester et al.(37), which generalizes the density-based algorithm known as DBSCAN.
SUBCLU (density-connected SUBspace CLUstering)(21): this is the initial approach to make use of density-based subspace clustering, and it expands the concept behind DBSCAN so that it can function with data that is extremely dimensional. The method makes use of an algorithm to discover highly linked clusters across every one of the subspaces of high-dimensional information, and it makes use of the monotonicity property to get rid of higher-dimensional projections. This guarantees that the search space is reduced to a large extent, which makes it easier to find what you're looking for. It is oblivious to the problems that are inherent in grid-based approaches, like being dependent in the location of the arrays or having a specified form for the clusters, both of which are required by grid-based approaches. This is because it is independent of both of these factors. In order to generate all 1-dimensional clusters, the technique starts by employing DBSCAN(35) to every 1-dimensional region. This is done in order to generate the clusters. The input parameters used to generate these clusters are the density threshold and distance (radius). The next step is for it to ascertain, for each k-dim cluster, that it is present in any of the (k-1)-dim clusters; when it does not, the cluster in question is eliminated from the database. In the final step, clusters are produced by employing DBSCAN on each potential subspace that possesses (k+1) dimensions. These stages will continue to be carried out in a recursive manner as long as the collection of k-dimensional subdomains that contain groups remains infinite.
INSCY (INdexing Subspace Clusters with in-process-removal of redundancY)(38) has yet another effective sector clustering algorithm, and it was founded on the idea of subspace clustering that was described in Assent et al.(39) In order to mine in an iterative manner in a region that contains all clusters across each subspace projection, the following steps must be taken, it uses a depth-first strategy, and then it continues with the next region. It begins by quickly trimming away all of its redundant low-dimensional projections before moving on to evaluating the maximum high-dimensional projection. This strategy improves efficiency because it eliminates a number of the disadvantages associated with breadth-first subspace clustering, and it significantly cuts down on the amount of time required for runtimes. In addition to that, it enables the tracking of potentially useful sub domain cluster regions. INSCY suggests a brand-new index structure that will be called the SCY-tree when it is implemented. This structure provides an organized version of the data and permits unrestricted access to the various subspaces SCY-tree is able to combine during duplication cutting, which enables extremely effective subspace clustering and also enables INSCY rapid and concise. These benefits are all due to SCY-tree's ability to combine these two features.
Scalable Density Based Subspace Clustering(40) reduces the amount of time spent processing subspaces by locating and grouping together potentially useful subspaces; only mines a small number of carefully chosen subspace groupings. It as well as their combinations directly, which helps to narrow down the search universe while maintaining the accuracy of the results.(41) It operates under the premise that any high-dimensional subspace grouping can be seen to exist in a variety projection of a limited dimensions. The process is capable of acquiring data by mining a subset of them, which is only some of them, enough information to skip processing the intermediate subspaces and go straight to the high-dimensional subspace clusters that contain the most fascinating data. When utilizing this method, it is possible to steer the process of subspace clustering without having to conduct any database searches at all for the many intermediate and redundant subspace projections. It accomplishes this goal by employing a priority queue in order to begin the process of initializing the data of density projections. It lays the groundwork for selecting the perfect applicant from the pool of candidates who have been given their priorities in order of importance. In order to make room for a very diverse range of density granularities, the priority queue has been sectioned off into three distinct categories. It does not proceed through the intermediary subspaces in a best-first fashion but rather goes straight to the higher dimensional subspaces.
CONCLUSION
From social media promotion to genome mapping, big data study has tremendous breadth and a lot of room for improvement. This is because big data has become the primary focus of many businesses and industries. The use of subscape clustering is one of the essential techniques that can be of assistance in the examination of big data. The purpose of this document is to accomplish the general introduction to Big Data and the challenges. A literature review is presented about the various types of Subspace clustering approaches, along with their categorization, in order to facilitate straightforward comprehension of the various classifications of Subspace algorithms and their classifications. A concise presentation on essential approaches to subspace clustering is also discussed here. This paper can help researchers comprehend big data, its challenges, and existing approaches, and it can also contribute to the design of new clustering methods for big data.
REFERENCES
1. David JM, Balakrishnan K. Prediction of Key Symptoms of Learning Disabilities in School-Age Children using Rough Sets. Int J Comput Electr Eng. 2011;3(1):163-169.
2. Gupta R. Journey from data mining to Web Mining to Big Data. IJCTT. 2014;10(1):18-20.
3. Sharma PP, Navdeti CP. Securing Big Data Hadoop: A Review of Security Issues, Threats and Solution. IJCSIT. 2014;5(2):2126-2131.
4. Gupta R, Gupta S, Singhal A. Big Data: Overview. IJCTT. 2014;9(5).
5. Jain A. Data clustering: 50 years beyond k-means. Pattern Recognition Letters. 2010;31(8):651-666.
6. Jain A, Dubes R. Algorithms for Clustering Data. Prentice Hall; 1988.
7. Karger DR. Random sampling in cut, flow, and network design problems. STOC. 1994;648-657.
8. Laney D. 3-D data management: Controlling data volume, velocity and variety. Application Delivery Strategies by META Group Inc. [Internet]. 2001 [cited 2023 Jun 10]. Available from: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
9. Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: From big data to big impact. MIS Quarterly. 2012;36(4):1165-1188.
10. Kwon O, Lee N, Shin B. Data quality management, data usage experience and acquisition intention of big data analytics. Int J Inf Manage. 2014;34(3):387-394.
11. TechAmerica Foundation’s Federal Big Data Commission. Demystifying big data: A practical guide to transforming the business of Government. [Internet]. 2012 [cited 2023 Jun 10]. Available from: http://www.techamerica.org/Docs/fileManager.cfm?f=techamerica-bigdatareport-final.pdf
12. Gartner IT Glossary. [Internet]. n.d. [cited 2023 Jun 10]. Available from: http://www.gartner.com/it-glossary/big-data/
13. Cukier K. The Economist, Data, data everywhere: A special report on managing information. February 25, 2010. [Internet]. [cited 2023 Jun 10]. Available from: http://www.economist.com/node/15557443
14. Chen L. Curse of Dimensionality. In: Liu L, Özsu MT, editors. Encyclopedia of Database Systems. Springer; 2009. p. 133.
15. Agrawal R, Gehrke J, Gunopulos D, Raghavan (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD. 1998;27(2):94-105.
16. Berchtold S, Bohm C, Keim D, Kriegel H-P. A Cost Model for Nearest Neighbour Search in High Dimensional Data Space. PODS. 1997;78-86.
17. Lance P, Haque E, Liu H. Subspace Clustering for High Dimensional Data: A Review. ACM SIGKDD Explorations Newsletter. 2004;6(1):90-105.
18. Ilango MR, Mohan V. A survey of Grid Based Clustering Algorithms. Int J Eng Sci Technol. 2010;2(8):3441-3446.
19. Karlton S, Zaki M. SCHISM: A New Approach to Interesting Subspace Mining. Int J Bus Intell Data Min. 2005;1(2):137-160.
20. Goil S, Nagesh H, Choudhary A. MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report CPDC-TR-9906-010. Northwestern University; 1999.
21. Kailing K, Kriegel H-P, Kroger P. Density-Connected Subspace Clustering for High Dimensional Data. SIAM International Conference on Data Mining. 2004;46-257.
22. Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS. Fast Algorithms for Projected Clustering. ACM SIGMOD International Conference on Management of Data. 1999;61-72.
23. Ng RT, Han J. CLARANS: A Method for Clustering.
24. Aggarwal C, Yu P. Finding Generalized Projected Clusters in High Dimensional Spaces. ACM SIGMOD International Conference on Management of Data. 2000;70–81.
25. Procopiuc C, Jones M, Agarwal PK, Murali TM. A Monte Carlo Algorithm for Fast Projective Clustering. ACM SIGMOD International Conference on Management of Data. 2002;418-427.
26. Kriegel HP, Kroger P, Zimek A. Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, & Correlation Clustering. ACM TKDD. 2009;3(1):1.
27. Wang H, Wang W, Yang J, Yu P. Clustering by Pattern Similarity in Large Data Sets. ACM SIGMOD International Conference on Management of Data. 2002;394-405.
28. Bohm C, Kailing K, Kriegel H-P, Kroger P. Density Connected Clustering with Local Subspace Preferences. IEEE International Conference on Data Mining. 2004;27-34.
29. Friedman J, Meulman J. Clustering objects on subsets of attributes. J R Stat Soc Ser B. 2004;66:815-849.
30. Kriegel HP, Kroger P, Renz M, Wurst S. A Generic Framework for Efficient Subspace Clustering of High Dimensional Data. IEEE International Conference on Data Mining. 2005;250-257.
31. Blum A, Langley P. Selection of Relevant Features and Examples in Machine Learning. Artif Intell. 1997;97:245–271.
32. Müller E, Günnemann S, Assent I, Seidl T. Evaluating Clustering in Subspace Projections of High Dimensional Data. VLDB Endowment. 2009;2(1):1270-1281.
33. Cheng CH, Fu AW, Zhang Y. Entropy-based subspace clustering for mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1999;84-93.
34. Ng R, Han J. Efficient and effective clustering methods for spatial data mining. VLDB Conference. 1994;144-155.
35. Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial data sets with noise. Proc Int Conf Knowl Discov Data Min. 1996;226–231.
36. Katayama N, Satoh S. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. ACM SIGMOD International Conference on Management of Data. 1997.
37. Ester M, Kriegel H-P, Sander J, Wimmer M, Xu X. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB Conference. 1998.
38. Assent I, Krieger R, Müller E, Seidl T. INSCY: Indexing Subspace Clusters with In Process-Removal of Redundancy. IEEE International Conference on Data Mining. 2008;414–425.
39. Assent I, Krieger R, Muller E, Seidl T. DUSC: Dimensionality Unbiased Subspace Clustering. IEEE Intl. Conf. on Data Mining (ICDM). 2007;409-414.
40. Müller E, Assesnt I, Gunnemann S, Seidl T. Scalable Density based Subspace Clustering. ACM Conference on Information and Knowledge Management (CIKM’11). 2011;1076-1086.
41. Sangapu SC, Prasad KSN, Kannan RJ, et al. Impact of class imbalance in VeReMi dataset for misbehavior detection in autonomous vehicles. Soft Comput. 2023. https://doi.org/10.1007/s00500-023-08003-4.
FINANCING
No financing.
CONFLICT OF INTEREST
None.
AUTHORSHIP CONTRIBUTION
Conceptualization: Vijaya Kishore Veparala, Vattikunta Kalpana.
Research: Vijaya Kishore Veparala, Vattikunta Kalpana.
Writing - proofreading and editing: Vijaya Kishore Veparala, Vattikunta Kalpana.