What is cluster analysis used for? Cluster analysis is a study by dividing a set of objects into homogeneous groups

Introduction

Chapter 1. Theoretical basis big data analysis

1.1 About Big Data

.2 Map-Reduce

.3 Data Mining for Big Data

1.4 Tasks solved by Data Mining methods

Conclusion to the first chapter

Chapter 2. Cluster Analysis for Big Data

.1 Choosing a clustering method

.2 Hierarchical methods

.3 Non-hierarchical methods

.4 Comparison of types of clustering

.5 Statistics related to cluster analysis

Conclusion to the second chapter

Chapter 3 .1 Client profile

.2 Compliance analysis

.3 Main idea of cluster analysis

.4 Features for clustering

.5 Identification of points homogeneous in location

.5.1 Final stratification

.6 Clustering objects into homogeneous groups

.7 Assortment clustering outlets

Conclusion to the third chapter

Conclusion

Bibliography

Introduction

Humanity in its development uses material, energy, instrumental and information resources. Information about the events of the past, present and possible future is of great interest for the analysis of what is happening. As the ancients said: Praemonitus praemunitus - "forewarned is forearmed".

The modern development of society is characterized by an unprecedented growth of information flows - in industry, trade, financial markets. The ability of a society to store and quickly process information generally determines the level of development of the statehood of a country.

The problem of collecting, storing and processing information in modern society is given great attention. However, at the moment there is a clear contradiction. On the one hand, human civilization is experiencing an information explosion, the amount of information increases many times every year. On the other hand, the growth of the current volume of information in society exceeds the individual's ability to assimilate it. The presence of such problems initiates the mass development of technologies, technical means, switching flows.

The vital role of information in modern world, led to the identification of information as own resource, as important and necessary as energy, financial, raw materials.

The needs of society in the collection, storage and processing of information as a commodity have created a new range of services - the market information technologies.

For the most complete and complete use of information technologies, information needs to be collected, processed, storage and accumulation places created, transmission systems and access restriction systems created, and finally, information needs to be systematized. The latter problem is most relevant in recent times, since a large, even huge, amount of information entering global storage arrays, without its systematization, can lead to an information collapse, when accessing or searching for the right information can lead to searching for a needle in a haystack.

The purpose of this work : Comparative analysis of cluster analysis methods in solving grouping problems.

Task : Analyze approaches to the use of cluster analysis in the problems of typing a large set of data.

In the course of the work, various methods of cluster analysis will be used in order to identify the advantages and disadvantages of each of them, as well as to choose the most optimal for the implementation of the tasks. The main issue of cluster analysis will also be raised - the question of the number of clusters, and recommendations will be given for its solution. The relevance of this work is due to the urgent need to determine the optimal methods for processing large amounts of data and solving problems of data systematization in as soon as possible. The wide practical application of the data obtained through cluster analysis determines the relevance of this study. Certain aspects of such problems in the modern development of information technologies are the subject of my thesis.

Chapter 1. Theoretical Foundations of AnalysisBigData

.1 About Big Data

The term "Big Data" characterizes datasets with possible exponential growth that are too large, too unformatted, or not at all structured to be analyzed by traditional methods.

Big Data technologies - a series of approaches, tools and methods for processing structured and unstructured data of huge volumes and a significant variety. These technologies are used to obtain human-perceptible results that are effective in conditions of continuous growth, distribution of information over numerous nodes of a computer network. They were formed in the late 2000s as an alternative to traditional database management systems and business intelligence solutions. Currently, most of the largest information technology vendors for organizations use the concept of "big data" in their business strategies, and the main analysts of the information technology market devote dedicated studies to the concept.

Currently, a significant number of companies are closely following the development of technology. According to McKinsey's "Global Institute, Big data: The next frontier for innovation, competition, and productivity" reports, data has become an important factor of production along with labor and capital resources. . The use of Big Data is becoming the basis for the competitive advantage and growth of companies.

In modern conditions, organizations and companies create a huge amount of unstructured data: text, various documents, images, videos, machine codes, tables, and the like. All of this information is hosted and stored in multiple repositories, often outside the organization.

Organizations can have access to a huge amount of their own data, but at the same time necessary tools, with which it would be realistic to establish relationships between all these data and, based on them, draw meaningful conclusions, may not have. Given the rapid and continuous growth of data, it becomes urgently necessary to move from traditional methods analysis to more advanced technologies of the Big Data class.

Characteristics. In modern sources, the concept of Big Data is defined as volume data in the order of terabytes. Signs of Big Data can be defined as "three V": volume - volume; variety - heterogeneity, set; velocity - speed (requires very fast processing).

Figure 1 Signs of big data

· Volume. The rapid development of technology and the popularization of social networks contribute to the very rapid growth of data volumes. This data, generated by both humans and machines, is distributed in various places and formats in huge volumes.

· Speed. This feature is the speed of data generation. Getting the data you need in the shortest possible time is an important competitive advantage for solution developers, also because different applications have different latency requirements.

· Diversity. The diversity can be attributed to different data storage formats. Today, significant amounts of unstructured data are generated in the world, and this is in addition to the structured data that enterprises receive. Before the advent of the era of development of Big Data technology, there were no powerful and reliable tools in the industry that would be able to work with the voluminous unstructured data that we see today.

Consuming vast amounts of structured data generated both inside and outside the enterprise is a necessity for organizations in today's world in order to remain competitive.

The "category" Big data traditionally includes not only the usual spreadsheets, but also unstructured data that can be stored in the form of images, audio files, video files, web logs, sensor data, and many others. A variation in the world of big data will be called this aspect of various data formats.

Below in Figure 2 is a comparative description of the traditional database and the Big Data database.

There are a number of industries in which data is collected and accumulated very intensively. For applications of this class, in which there is a need to store data for years, the accumulated data is classified as Extremely Big Data.

There is also an increase in the number of Big Data applications in commercial and government sectors, the volume of data of such applications is in storage and often amounts to hundreds of petabytes.

Figure 2 Comparative characteristics data

The development of certain technologies makes it possible to "track" people, their habits, interests and consumer behavior in various ways. Examples include using the internet in general and specifically shopping at online retailers such as Walmart (according to Wikipedia, Walmart's data storage is valued at over 2 petabytes), or traveling and moving around with mobile phones, make calls, write letters, take photos, log into social network accounts from different parts of the world - all this is accumulated in databases and can be usefully used due to the fast processing of big data.

Likewise, modern medical technology generate large amounts of data related to the provision of medical care (images, videos, real-time monitoring).

Sources of big data. Just as data storage formats have changed, data sources have also evolved and are constantly expanding. Data needs to be stored in a wide variety of formats.

With the development and advancement of technology, the amount of data that is generated is constantly growing. Big data sources can be divided into six different categories as shown below.

Figure 3 Sources of big data

· Enterprise data. Enterprises have large amounts of data in various formats. Common formats include flat files, emails, Word documents, spreadsheets, presentations, HTML pages, PDF documents, XML files, legacy formats, etc. This data, distributed throughout the organization in various formats, called corporate data .

· transactional data. Each enterprise has its own applications, which include the execution various kinds transactions such as web applications, mobile applications, CRM-systems and many others.

To support transactions in these applications, one or more relational databases are typically used as the underlying infrastructure. Basically it is structured data and is called transactional. data.

· Social media. Social networks such as Twitter, Facebook and many others generate a large number of data. Typically, social networks use unstructured data formats, including text, images, audio, video. This category of data sources is called social media .

· Activity Generate. This includes data from medical devices, censored data, surveillance video, satellites, cell phone towers, industrial equipment, and other data generated primarily by machines. These types of data are called data Activity Generate.

· public data. This data includes data that is publicly available, such as data published by governments, research data published by research institutes, data from meteorological and meteorological departments, census data, Wikipedia, open source data samples, and other data that is freely available to the public. This type of public data is called public data .

· Archive. Organizations archive a lot of data that is either no longer needed or is rarely needed. In today's world where hardware is getting cheaper, no organization wants to delete any data, they want to keep as much data as possible. This type, which is less frequently accessed, is called archival data.

Implementation examples. As an example of the implementation of this technology, the Hadoop project is most often cited, which is designed to implement distributed computing used to process impressive amounts of data.

This project is being developed by the Apache Software Foundation. Cloudera is supporting this project commercially.

Developers from various countries of the world are involved in the project as participants. information clustering provider

Technologically, Apache Hadoop can be called a free Java framework that supports the execution of distributed applications running on large clusters built on standard hardware.

Since data processing is performed on a cluster of servers, if one of them fails, the work will be redistributed among other working ones.

It is also necessary to say about the implementation of the MapReduce technology in Hadoop, the main task of which is the automatic parallelization of data and their processing on clusters.

The core of Hadoop is a fault-tolerant distributed file system HDFS (Hadoop Distributed File System), which operates storage systems.

The essence of the system is to split the incoming data into blocks, for which there is a specially allocated position in the server pool for each of them. The system makes it possible for applications to scale. A tier will be thousands of nodes and petabytes of data.

1.2 Map-Reduce

In this paragraph, we will focus on the Map-Reduce algorithm, which is a model for distributed computing.

The principles of its operation are based on the distribution of input data to the working nodes of a distributed file system for pre-processing (map-step) and, then, the convolution (combination) of pre-processed data (reduce step) follows.

The algorithm calculates the subtotals of each distributed file system node, then calculates the sum of the subtotals and arrives at the final sum.

Magic Quadrant for Storage Management Solution Providers (Gartner, February 2017)

Figure 4 Leaders

Companies:

Leaders: IBM, SAS, RapidMiner, KNIME

Challengers: MathWorks, Quest (formerly Dell), Alteryx, Angoss

Visualizers: Microsoft, H2O.ai, Dataiku, Domino Data Lab, Alpine Data

Niche players: FICO, SAP, Teradata

1.3 Data Miningforworkwithbig data

data mining(DM) - “This is a technology that is designed to search for non-obvious, objective and practical patterns in large amounts of data.”

A feature of Data Mining is the combination of a wide mathematical toolkit (from the classical statistical analysis to new cybernetic methods) and the latest advances in information technology.

This technology combines strictly formalized methods and methods of informal analysis, i.e. quantitative and qualitative analysis data.

.4 Tasks solved by Data Mining methods

· Correlation - establishing a statistical dependence of continuous output on input variables.

· Clustering is a grouping of objects (observations, events) based on data (properties) that describe the essence of these objects. Objects within a cluster must be "similar" to each other and at the same time have differences from objects that fall into other clusters.

The accuracy of clustering will be higher if the objects within the cluster are as similar as possible, and the clusters are as different as possible.

· Classification is the assignment of objects (observations, events) to one of the previously known classes.

· Association - identifying patterns between related events. An example of such a pattern is a rule that indicates that event Y follows from event X. Such rules are called associative.

Conclusion to the first chapter

Big data is not just another hype in the IT market, it is a systematic, high-quality transition to compiling knowledge-based value chains.

In terms of effect, it can be compared with the appearance of affordable computer technology at the end of the last century.

While short-sighted conservatives will apply deeply outdated approaches, enterprises that already use Big Data technologies in the future will be in the lead and receive competitive advantages On the market. There is no doubt that all major organizations will implement this technology in the coming years, as it is both present and future.

Chapter 2. Cluster analysis forBigData

Cluster analysis is a class of methods that are used to classify objects or events in sufficiently homogeneous groups, which will be called clusters.

It will be fundamental that objects in clusters must be similar to each other, but at the same time, they must be different from objects located in other clusters.

Figure 5 illustrates an ideal clustering situation, each of the clusters is clearly separated based on the differences of two variables: quality orientation (X), and price sensitivity (Y),

Figure 5 Ideal Clustering Situation

It should be noted that absolutely every consumer falls into one of the clusters, and there are no overlapping areas.

However, the illustration below shows the most common clustering situation in practice.

In accordance with the data in Figure 6, the boundaries of the clusters are extremely vague, and it is not entirely clear which consumers are assigned to which cluster, since a significant part of them cannot be grouped into one or another cluster.

Figure 6 Real situation of clustering

In cluster analysis, groups or clusters are identified using the collected actual data, and not in advance. Thus - there is absolutely no need to prepare preliminary information about the cluster membership of any of the objects .

Market segmentation. For example, consumers should be divided into clusters based on the benefits they expect from the purchase of a given product. A cluster may contain consumers seeking similar benefits. This method is commonly referred to as the benefit segmentation method.

Understanding Buyer Behavior. Using cluster analysis if it is necessary to identify homogeneous categories of buyers.

Determining the possibilities of a new product. The definition of competitive groups and sets within a given market is also carried out through clustering trademarks and goods.

Selection of test markets. A selection of such cities in order to test multiple marketing strategies is carried out by grouping cities into homogeneous clusters.

Data Dimension Reduction X. Cluster analysis is also used as a primary data reduction tool to create clusters or subgroups of data that are more amenable to analysis than individual observations. Further, the multivariate analysis carried out is performed on clusters, and not on individual observations.

2.1 Clustering methods

There are two types of clustering methods: hierarchical and non-hierarchical.

Figure 7 Cluster analysis methods

.2 Hierarchical methods

Hierarchical Methods are divided into two types - agglomerative and divisive.

Agglomerative clustering starts with each object in a separate cluster. Objects are grouped into ever larger clusters. This process will continue until all objects become members of a single cluster.

It should also be highlighted divisional clustering, which originates from all objects that are grouped in a single cluster. Clusters will divide until each object is in a separate cluster. Most often for research are taken agglomerative methods, such as communication methods, as well as dispersive and centroid methods.

Communication methods include single link method, full link method and average link method. Link methods are agglomerative hierarchical clustering methods that combine objects into a cluster based on the calculated distance between them.

Figure 8 Single link method

At the core single link method lies the minimum distance, or nearest neighbor rule (Formula 1) .

When forming a cluster, two objects are first combined, the distance between which is minimal. Next, the next shortest distance is determined, and a third object is introduced into the cluster with the first two objects.

At each stage, the distance between two clusters is the distance between their nearest points. At any stage, two clusters are combined by the single shortest distance between them.

This process is continued until all objects are clustered. If the clusters are poorly defined, then the single link method does not work well enough.

Figure 9 Full link method

At the core full link method lies the maximum distance between objects, or the far neighbor rule. In the full link method, the distance between two clusters is calculated as the distance between their two outermost points.

Figure 10 Average link method

AT average connection method the distance between two clusters is defined as the average of all distances measured between objects in two clusters, with each pair including objects from different clusters. The average link method uses information about all distances between pairs, not just the minimum or maximum distance. For this reason, the average linkage method is generally preferred over single or full linkage methods.

Dispersion methods clusters are formed in such a way as to minimize intracluster dispersion.

Figure 11 Ward Method

A widely known dispersion method used for this purpose is Ward's method, in which the clusters are formed in such a way as to minimize the squares of the Euclidean distances to the cluster means.

For each cluster, the averages of all variables are calculated. Then, for each object, the squared Euclidean distances to the cluster means are calculated.

These squared distances are summed for all objects. At each stage, two clusters with the smallest increase in the total intracluster variance are combined.

Figure 12 Centroid Method

AT centroid methods the distance between two clusters is the distance between their centroids (averages for all variables).

The centroid method is a dispersion method for hierarchical clustering. Each time the objects are grouped and calculated new center oid.

The Ward method and the average connection show the best results of all hierarchical methods.

2.3 Non-hierarchical methods

Another type of clustering procedures are non-hierarchical methods clustering, often referred to as the k-means method. k-means method(k-means clustering) - a method that determines the center of the cluster, and next groups all objects within the threshold value specified from the center. These methods include sequential thresholding, parallel thresholding, and optimizing allocation.

where k is the number of clusters, _(i)) are the resulting clusters, i=1,2,…,k

Centers of mass of vectors .

Figure 13 An example of the operation of the k-means algorithm (k=2)

AT sequential threshold method groups objects together that are within a threshold value with a given center.

The next step is to define a new cluster center, and this process will be repeated for ungrouped points. After placing an object in a cluster with a new center, it will no longer be considered as an object for further clustering.

Works in a similar way parallel threshold method, but it has one important difference - several cluster centers are selected simultaneously and objects that are within the threshold level are grouped with the nearest center.

Optimizing Distribution Method will differ from the two previous threshold methods in that the objects can subsequently be assigned to other clusters (redistributed), in order to optimize the overall criterion, which is the average intra-cluster distance set for a given number of clusters.

BIRCH algorithm due to generalized representations of clusters, the clustering speed increases, while the algorithm has a large scaling. This algorithm implements a two-stage clustering process.

The first stage is to form a preliminary set of clusters. The next step is to apply to the identified clusters other clustering algorithms that would be suitable for working with RAM.

Imagine each data element as a bead that lies on the surface of the table, then it is absolutely possible to “replace” these clusters with tennis balls and then proceed to study tennis ball clusters in more detail.

The number of beads can be quite large, but the diameter of the tennis balls can really be chosen so that at the second stage, using traditional clustering algorithms, it becomes possible to determine the actual complex shape of the clusters.

Among the new scalable algorithms, one can also note the algorithm CURE- hierarchical clustering algorithm, where the concept of a cluster is formulated using the concept of density. Many researchers are actively working on scalable methods, whose main task is to overcome the shortcomings of the algorithms that exist today.

2.4 Comparison of types of clustering

The table lists the advantages and disadvantages of methods such as: CURE algorithm, BIRCH, MST, k-means (k-means), PAM, CLOPE, Kohonen self-organizing maps, HCM (Hard C - Means), Fuzzy C-means.

2.5 Statistics related to cluster analysis

The following statistics and concepts are related to cluster analysis:

1. Cluster centroid. Average value of variables for all cases or objects in a particular cluster.

2. Cluster centers. Initial starting points in non-hierarchical clustering. Clusters are built around these centers, or grains of clustering.

3. Belonging to a cluster. Specifies the cluster that each case or object belongs to.

4. Tree diagram- a graphic tool for displaying clustering results. The vertical lines represent clusters being merged. The position of the vertical line on the distance scale shows the distances at which the clusters were combined. This diagram is read from left to right.

5. Variation index. Checking the quality of clustering. The ratio of the standard deviation to the mean.

7. Icicle diagram. This is a graphical display of clustering results.

8. Matrix of similarity / matrix of distances between the combined objects is a lower triangular matrix containing distance values between pairs of objects or cases

Conclusion to the second chapter

Cluster analysis can truly be called the most convenient and most optimal tool for identifying market segments. The use of these methods has become especially relevant in the century high technology, in which it is so important to accelerate labor-intensive and lengthy processes with the help of technology. The variables used as the basis for clustering will be the right choice based on the experience of previous studies, theoretical background, various tested hypotheses, and also based on the wishes of the researcher. In addition, it is recommended to take an appropriate measure of similarity. A distinctive feature of hierarchical clustering is the development of a hierarchical structure. Two types of hierarchical clustering methods exist and are used - agglomerative and divisive.

Agglomerative methods include: single, full and average connection method. The most common dispersion method is the Bard method. Non-hierarchical clustering methods are often referred to as k-means methods. The choice of the clustering method and the choice of the distance measure are interrelated. In hierarchical clustering, an important criterion for deciding on the number of clusters is the distance at which clusters are combined. The relative sizes of clusters should be such that it makes sense to keep this cluster, and not to merge it with others. Clusters are interpreted in terms of cluster centroids. It is often helpful to interpret clusters by profiling them through variables that did not underlie the clustering. The reliability and validity of clustering solutions are evaluated in different ways.

Chapter 3

A trading enterprise with 36,651 outlets selling confectionery products was taken as the object under study. The list of goods sold by the enterprise includes more than 350 units of products.

The aim of this study will be comparative analysis cluster analysis methods in problem solving:

Study of the client's profile and analysis of the correspondence of the relationships of the given characteristics;

2. Division into clusters - allocation of homogeneous groups;

The division into homogeneous groups of the assortment of a trading enterprise.

.1 Client profile

According to a Galileo study conducted in the second half of 2016, about 42 million people who consume confectionery were interviewed.

It follows from this survey that the main consumers of confectionery products are women.

This can be attributed to the fact that women traditionally receive chocolate products as a gift, and most of the lovers of confectionery are women. This can be clearly seen in Figure 10.

· up to 16 years old - the main consumers of chocolate in the form of figures;

· from 16 to 24 years old - the main consumers of chocolate bars;

chocolate in a bar in most cases is purchased by women from 25 to 34 years old;

· people from 25 to 45 years old - the main buyers of sweets in boxes;

· From 45 and older prefer loose sweets.

Figure 14 Confectionery consumption by gender

Figure 12 shows the distribution of total consumption into 3 groups, depending on wealth: A-low, B-medium, C-high. The lion's share of consumers falls on the group with an average income - 54%, followed by a group with a low income - 29%, the smallest contribution is made by a group with a high income - 17%.

Figure 15 Confectionery consumption by income

This graph illustrates the preferences of the audience in choosing the place of purchase, let's also consider the distribution depending on income. Obviously, the largest number of purchases are made in hyper and supermarkets, which is true in relation to each of the income groups.

The share of purchases in supermarkets is almost half (46%) for group C, on the basis of which it can be concluded that it is expedient to expand the range of goods popular among people with high incomes.

Middle-income people account for 41% of supermarket purchases, while low-income people account for the smallest share at 37%. Next comes the share of purchases in small self-service stores; purchases in such stores are made by all three groups in equal proportions. The smallest share falls on markets and stalls, where the main contribution is made by representatives of group A, which includes a large number of pensioners who often make purchases in the market “out of habit”.

Figure 16 Locations of confectionery purchases by income

The following graph clearly illustrates the degree of importance of a particular feature of the product for each of the three income groups. For groups A and B, price is the most important factor, and appearance packaging and the country of origin of the goods is of little importance. The behavior of representatives of the high-income group will be slightly different, where, in addition to price, the brand and appearance and country of production of the goods are important.

Figure 17 Priorities when choosing confectionery products for different income groups

.2 Compliance analysis

Correspondence analysis is used to visualize tables. This method allows you to identify the relationship between features in the columns and rows of the table.

Let us further consider the analysis of the correspondence between the consumption of confectionery products by gender and age, illustrated in Figure 7, as well as Figure 8, which shows the consumption of various categories of products depending on the income of consumers.

First, let's consider the preferences of three groups of men: aged 16-19, 20-24 and 25-34, since their consumer preferences can be characterized as almost identical.

Figure 18 Correspondence analysis of popular sweets by age and gender

Men in these age groups prefer Snickers, Mars, Nuts, Twix, Picnic, Kinder bueno and M&m's candies. These types of products fall into the "Chocolate Bars and Other Chocolates in Small Packages" category and will be most popular among low-income individuals.

The four remaining age groups for men follow: 35-44, 45-54, 55-64, 65-74. They will also be characterized by approximately the same consumer behavior and they are extremely passive consumers. For these groups, the assertion is true that with an increase in the level of income, the level of consumption will change inversely, that is, among men aged 35-74 with a high income, there will be the lowest consumer activity.

Obviously, the niche that includes solvent men 35-74 is very promising and at the same time unoccupied, but the existing set of products is not able to satisfy the needs of this category of consumers. Based on the foregoing, we can make an input that it makes sense to influence this target audience with some completely new product that can attract consumers.

The next step will be to describe groups of women aged 16-19, 20-24, 25-34 who have similar consumer behavior. The mentioned groups, as a rule, prefer chocolate bars, some of them will be similar to those preferred by men of the same age - Picnic, Twix, Nuts, etc., and Tempo, bounty, Kit Kat, Milky way bars are also very popular among women. , Kinder country, an ordinary miracle.

For these groups, the low income rule will also hold true, as it increases, the popularity of chocolate bars will decrease. This is followed by a group of women aged 35-44, with Alpen Gold being the most popular choice, followed by Geisha and a mini fad cake, a statement that holds true for low- and middle-income individuals alike. As the age increases, the following become preferable (groups 45-54, 55-64, 65-74): Alenka, Korovka, Sladko, sweets of the Krupskaya group and other domestic ones. This is most true in relation to persons with an average income. Assessing the consumption of confectionery products in general, it should be noted that 2/3 of all consumption falls on the female share of the population.

.3 Main idea of cluster analysis

Before applying the clustering algorithm, all outlets are divided into strata. The algorithm is applied separately to each of the obtained strata. The clusters obtained for individual groups are then combined into one final set of clusters.

Let us describe the details of the clustering algorithm. Let us denote the number of outlets to which the algorithm is applied by , the set of outlets by , the Euclidean metric by , and the number of features by . The number of features and, as a consequence, their number depend on the stratum.

First of all, the values of all features are standardized. Standardization is the transformation of a feature by subtracting its mean and dividing by its standard deviation. The mean and standard deviation are calculated once over the data being clustered and are part of the clustering model.

We use the KMeans algorithm as the clustering algorithm. This algorithm requires specifying the number of clusters and the number of initializations of the iterative clustering process (or initial centroids). The number of initializations depends on the time available for clustering. To determine the number of clusters, we use the KMeans algorithm with the number of clusters from 2 to 75. Denote the resulting clustering models by , and the centroids by . For each, we determine the measure of intracluster spread

We can consider a clustering model for the case . In this case, there is only one centroid, defined as the element-wise average of all . The measure of intra-cluster scatter resulting in this case is called the measure of the total scatter of outlets:

Attitude

can be interpreted as the proportion of unexplained differences between outlets within clusters. This ratio decreases as . We define the optimal number of clusters as

In other words, we choose the minimum number of clusters so that the proportion of unexplained differences is no more than 20%.

Note . Instead of the value 0.2, you can take any value from 0 to 1. The choice depends on the restrictions on the number of clusters, as well as on the type of graph depending on the ratio of . However, if the maximum allowable proportion of unexplained differences is set before the start of clustering, then for the search it is not necessary to build cluster models for all from 2 to 75. You can use the binary search method, which significantly increases the speed of clustering.

As a result of clustering, we get the following components of the complete clustering model:

· - average values of features for the stratum and type ;

· - standard deviations of features for the stratum and type ;

· - the optimal number of clusters for the stratum and type ;

· - clustering model obtained with the optimal number of clusters for the stratum and type .

The algorithm for applying the full clustering model is as follows. Let there be an outlet of type belonging to the stratum given by the feature vector . By vector we define a vector with elements

We apply the clustering model to the resulting vector. As a result, we obtain the cluster number . Thus, the "cluster number" within the framework of the full clustering model consists of three parts:

· stratum;

· cluster number according to the clustering model for the stratum and type (hereinafter, this number will be called simply the cluster number).

3.4 Features for clustering

For clustering, it is necessary to compile a list of features that describe outlets. The following indicators were used to characterize outlets:

· Distances to places of population attraction (hereinafter MPN);

· Competitive environment. Distance to transport infrastructure facilities and other outlets of KA-networks and non-KA-networks (distances to the nearest object and the number of objects within a radius of 1000 meters are determined);

· Solvency of the population in the vicinity of the outlet.

Formally, the features also include the stratum and type of outlet. However, clustering on these features is not carried out.

List of signs for outlets:

) income of the population ( income);

2) average cost 1 square meter housing ( sqm_ price;);

) the average cost of renting a one-room apartment ( rent_ price) ;

) the number of MPN of any type within a radius of 1000 meters ( number_ in_ radius_ mpn_ all);

) the number of outlets of non-KA networks within a radius of 1000 meters ( number_ in_ radius_ tt);

) the number of outlets of KA networks within a radius of 1000 meters ( number_in_radius_ ka);

) the number of railway stations within a radius of 1000 meters ( number_ in_ radius_ railway_ station);

) the number of metro stations within a radius of 1000 meters (field number_ in_ radius_ subway_ station);

) number of ground public transport stops within a radius of 1000 meters ( number_ in_ radius_ city);

) distance to the nearest MPN of arbitrary type ( dist_ to_ closest_ mpn);

) distance to the nearest railway station ( pts_railway_station_d01_distance);

) distance to the nearest metro station ( pts_subway_station_d01_distance);

) distance to the nearest stop of surface public transport ( pts_city_d01_distance);

) distance to the nearest non-KA-network outlet ( tt_to_tt_d001_distance);

) distance to the nearest outlet of the KA-network ( ka_d01_distance);

3.5 Identification of points that are homogeneous in location

As part of data preparation, all data were divided into homogeneous strata by population. This is necessary to perform further high-quality clustering. When dividing into strata, the method of comparing averages was applied. The quality of the partition was checked by the degree of difference between the strata based on nonparametric analysis of variance. The results of the application are shown below:

1. Income of the population . The hypothesis of equality of income for 4 strata was rejected (see table 1).

Table 1 Hypothesis about the income of the population

As can be seen from Figure 20, there is a noticeable difference in the average value of income. In the first stratum, it is significantly higher than in the others. Lowest Income noted in the fourth stratum.

Figure 20 Comparisons across strata (population income)

2. average cost one square meter of housing . The hypothesis about the equality of the cost of 1 square. meters of housing for 4 strata was rejected (see table 2).

Table 2. Hypothesis about the average cost of 1 square meter of housing

As can be seen from Figure 21, there is a noticeable difference in the average value of the cost of 1 sq. meters of housing. In the first stratum, it is significantly higher than in the others. The smallest value is in the second stratum. In strata 3 and 4, the cost is approximately the same.

Figure 21 Comparisons between strata (cost per square meter of housing)

3. The average cost of renting a one-room apartment . The hypothesis of equal rental costs for the 4 strata was rejected (see Table 3).

Table 3 Hypothesis about the average cost of rent

As can be seen from Figure 22, there is a noticeable difference in the average value of the cost of housing rental. In the first stratum, it is significantly higher than in the others. The smallest value is in the second stratum.

Figure 22 Comparisons across strata (average rental cost)

4. The number of MPN of any type within a radius of 1000 meters . The hypothesis for 4 strata was rejected (see Table 4).

Table 4. Hypothesis about the number of MPNs

As can be seen from Figure 23, there is a noticeable difference in the average value of the number of MPNs. In the first stratum, it is significantly higher than in the others. The smallest number of MPNs is in the fourth stratum.

Figure 23 Comparisons between strata (number of MPNs)

5. The number of outlets is not KA - networks within a radius of 1000 meters . The hypothesis for 4 strata was rejected (see Table 5).

Table 5 Hypothesis about the number of retail outlets of non-KA networks

As can be seen from Figure 24, there is a noticeable difference in the average values. In the second stratum, the average value is significantly higher than in the rest. The smallest value is in the fourth stratum.

Figure 24 Comparisons between strata (Number of non-KA TTs)

6. Number of outlets KA - networks within a radius of 1000 meters . The hypothesis for 4 strata was rejected (see Table 6).

Table 6 Hypothesis about the number of outlets of KA networks

As can be seen from Figure 25, there is a noticeable difference in the average values.

In the second stratum, the average value is higher than in the others, and the lowest in the fourth stratum.

Figure 25 Comparisons between strata (Number of TT KA networks)

. Number of railway stations within a radius of 1000 meters . The hypothesis for 4 strata was rejected (see Table 7).

Table 7 Hypothesis about the number of railway stations

As can be seen from Figure 26, there is a noticeable difference in the average values.

In the first stratum, the average value is higher than in the others.

The smallest number of railway stations in the third and fourth strata.

8. Number of ground public transport stops within a radius of 1000 meters. The hypothesis for 4 strata was rejected (see Table 8).

Table 8 Hypothesis on the number of ground transport stops

As can be seen from Figure 27, there is a noticeable difference in the average values. In the first stratum, the average value is higher than in the others, the lowest value is in the 4th stratum.

Figure 27 Comparisons between strata (number of stops land transport)

9. Distance to the nearest MPN of any type. The hypothesis for 4 strata was rejected (see Table 9).

Table 9 Hypothesis about the distance to the nearest MPN

As can be seen from Figure 28, there is a noticeable difference in the average values. In the fourth stratum, the average value is higher than in the others. The lowest value is noted in the first and second strata.

Figure 28 Comparisons between strata (number of land transport stops)

. Distance to the nearest railway station . The hypothesis for 4 strata was rejected (see Table 10).

Table 10 Hypothesis about the distance to the nearest railway station

As can be seen from Figure 29, there is a noticeable difference in the average values. In the fourth stratum, the average value is higher than in the others. The smallest value is noted in the first stratum.

Figure 29 Comparisons between strata (distance to nearest railway station)

11. Distance to the nearest metro station . The hypothesis for 4 strata was rejected (see Table 11).

Table 11 Hypothesis about the distance to the metro station

As can be seen from Figure 30, there is a noticeable difference in the average values. In the second, third and fourth stratum the average value is higher, and the lowest value is noted in the first stratum.

Figure 30 Comparisons between strata (distance to nearest metro station)

12. Distance to the nearest ground public transport stop. The hypothesis for 4 strata was rejected (see Table 12).

Table 12 Hypothesis about the distance to the nearest ground transport stop

As can be seen from Figure 31, there is a noticeable difference in the average values. In the fourth stratum, the average value is higher, and the lowest value is noted in stratum 1.

Figure 31 Comparisons between strata (distance to nearest ground transportation stop)

13. The distance to the nearest outlet is not KA -networks. The hypothesis for 4 strata was rejected (see Table 12).

Table 13 Hypothesis about the distance to the nearest non-KA-network outlet

As can be seen from Figure 32, there is a noticeable difference in the average values. In the third stratum, the average value is higher, and the lowest value is noted in the first, second, and third strata.

Figure 32 Comparisons between strata (distance to the nearest non-KA-network outlet)

14. Distance to the nearest outlet KA -networks

Table 14 Hypothesis about the distance to the nearest retail outlet of the KA network

As can be seen from Figure 33, there is a noticeable difference in the average values. In the third stratum, the average value is higher, and the lowest value is noted in the first, second, and third strata.

Figure 33 Comparisons between strata (distance to the nearest KA-network outlet)

Thus, as a result, the results of similarity of strata were obtained (see Table 15).

Table 15 Comparison between strata

.5.1 Final division into strata

As a result, a division into 4 strata was chosen with the assignment of satellite cities to the main cities. Stratu (field pop_ strata) we determine by the population in the locality in which the outlet is located.

· 1 stratum - large cities with a population of more than 1 million people;

2 strata - cities with a population of more than 250 thousand people and up to 1 million people;

3 strata - cities with a population of more than 100 thousand people and less than 250 thousand people;

4 strata - cities with a population of less than 100 thousand people.

.6 Clustering objects into homogeneous groups

To identify SPs with a similar location, we will cluster the objects (for each of the strata). Before applying clustering, it is necessary to identify more homogeneous retail outlets by location. To determine the quality of clustering, the variation index was used. As a result, 36,651 outlets were divided into 15 clusters (36,598 outlets) + the 16th cluster consists of 53 anomalous outlets. By anomalous we mean points with very high sales.

The following 7 indicators from descriptive statistics were used to characterize clusters:

· Minimum, lowest value of sales;

· Percentile 5%;

· Percentile 25%;

· Median is a point on the scale of measured sales values, above and below which lies half of all measured sales values;

· Percentile 75%;

95% percentile;

· Maximum, highest sales value.

Table 16 Final clustering

Table 1 clearly shows the final distribution of clusters within the strata. The largest number of outlets belong to the fourth stratum, and the smallest to the third stratum.

· Strat 1. For the first stratum (4402 outlets), by applying the k-means method (Chapter 2, paragraph 2.3), an optimal division into 4 clusters by 15 features was obtained. The number of clusters was chosen based on the optimization of the Akaike criterion.

· 1st cluster - includes such retail outlets whose areas are close to the center of large cities, or outlets located in shopping centers.

Cluster Profile : This cluster characterizes a significant number of places of attraction of the population (MPN), a high concentration of trade zones and a developed infrastructure.

Figure 34 Share of clusters in the first stratum

It makes up 61.5% of the total sales of the stratum. There are 2708 outlets in the cluster. Average monthly sales in retail outlets of this cluster are estimated in the range from 3 to 7 thousand rubles. The average income of the population is 34-36 thousand rubles, which is above average and ahead of most of the other clusters in this indicator.

The average cost of 1 square meter of housing will be 63 - 64 thousand rubles, which can be called an average. The average cost of renting a one-room apartment is estimated at 14-15 thousand rubles, which can also be described as an average figure in comparison with other clusters.

The number of places of attraction of the population of any type within a radius of 1000 meters is from 32 to 47 - an indicator above average, and the number of outlets of non-KA networks within a radius of 1000 meters is about 40 - 53, which is also an indicator above average. Sales points of KA-networks within a radius of 1000 meters are represented by an average of 10 units. The presence of railway stations within a radius of 1000 meters is estimated as no more than two.

This cluster is characterized by the complete absence of metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is 13-20 units.

: The distance to the nearest place of attraction of the population of an arbitrary type is minimal - nearby. The distance to the nearest railway station can be characterized as high - far. Distance to the nearest metro station - none in the area. The distance from the nearest ground public transport stop will be low, the characteristic will be nearby. The distance to the nearest non-KA-network outlet is minimal - nearby, and the distance from the nearest KA-network outlet is a little more, but also small, the characteristic is close.

· 2nd cluster - These are residential (sleeping) areas of large cities.

Cluster Profile : Insignificant number of MPN, low concentration of human traffic, shopping areas.

: It makes up 12.2% of the number of outlets in the stratum. There are 539 retail outlets in the cluster. Average monthly sales are estimated in the range of 3,000 to 8,000 rubles. The average income of the population is approximately estimated at 34 thousand rubles, which is similar to the indicators of the 1st and 3rd clusters of this stratum, but higher than the indicators of most clusters of other strata.

The average cost of 1 square meter of housing is 61 - 63 thousand rubles, and the average cost of renting a one-room apartment will be 14 - 15 thousand rubles, as in the first cluster. The number of places of attraction of the population of an arbitrary type within a radius of 1000 meters is 7-8 units, and the number of outlets of non-KA networks within a radius of 1000 meters is estimated at 24 to 43 units. The number of outlets of KA networks within a radius of 1000 meters will be 2. No more than two railway stations within a radius of 1000 meters. An important characteristic is the absence of metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is on average 3-4.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is quite low and is characterized as - close. The distance from the nearest railway station is high, the characteristic is far. The complete absence of metro stations. A characteristic different from the 1st cluster is a high distance to the nearest stop of ground public transport (far). The distance to the nearest non-KA-network outlet is low - nearby. And the distance to the nearest outlet of the KA-network is high - far

· 3rd cluster - it is the center of large cities.

Cluster Profile : The highest values in terms of the number of places of attraction of the population, indicators of trade activity and other places indicating a high level of economic activity and human flow.

The main quantitative and qualitative characteristics of the cluster: It makes up 25.9% of the total number of outlets in the stratum. This cluster includes the size of 1139 outlets. Average monthly sales range from 3.2 to 10 thousand rubles. The average income of the population is 36 thousand rubles and is a fairly good indicator - higher average.

The average cost of 1 square meter of housing is estimated at 63 - 68 thousand rubles, and the average cost of renting a one-room apartment is approximately 14 - 15 thousand rubles, which does not differ from the indicators of the 1st and 2nd cluster. high and equals 51 - 66 units, and there are 46 - 55 units of outlets of non-KA networks within a radius of 1000 meters, which is also a high figure.

The number of outlets of KA networks within a radius of 1000 meters is 15 - a lot. The presence of railway stations within a radius of 1000 meters is approximately one or two. The number of metro stations within a radius of 1000 meters is on average one, but no more than 3. The number of ground public transport stops within a radius of 1000 meters is 20-30 units, which is a very high figure.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is not high - nearby. The distance from the nearest railway station is also not high, the characteristic is close. The distance to the nearest metro station is low - close.

The nearest ground public transport stop is a very short distance - nearby. Low distance to the nearest non-KA-network outlet - nearby. The distance from the nearest KA-network outlet is also very low - nearby.

· 4th cluster - these are residential, expensive areas and private properties remote from the center.

Cluster Profile : The highest values cost characteristics(income, real estate), the lowest values of the number of MPN, trade indicators. It makes up only 0.4% of all retail outlets in the stratum.

The main quantitative and qualitative characteristics of the cluster : The cluster includes only 16 outlets and is the smallest of all clusters in the stratum. Sales per month range from 4 to 40 thousand rubles. The average monthly income of the population is 49-66 thousand rubles, which is a very high figure. The average cost of 1 square meter of housing is also very high and is estimated at 85 - 124 thousand rubles. The average cost of renting a one-room apartment is higher than in other clusters of this stratum and amounts to 21-34 thousand rubles. The number of MPNs of any type within a radius of 1000 meters is low - from 4 to 20. There are no outlets of non-KA networks within a radius of 1000 meters nearby. The number of outlets of KA-networks within a radius of 1000 meters is 2. The presence of railway stations within a radius of 1000 meters - no more than one. There are no more than two metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is only one.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is low - close. The distance from the nearest railway station is high - far. There are no metro stations nearby. The distance to the nearest stop of ground public transport is high, the characteristic is far. The distance from the nearest non-KA-network point of sale is very high - far away. This cluster characterizes the absence of KA-network outlets - none nearby.

stratum. For the second stratum (9269 outlets), by applying the k-means method (Chapter 2, paragraph 2.3), an optimal division into 5 clusters by 15 features was obtained. The number of clusters was chosen based on the optimization of the Akaike criterion.

Figure 35 Share of clusters in the second stratum

· 5th cluster - these are the outskirts of cities, small settlements.

Cluster Profile : Average values of infrastructure development indicators (there are Russian Railways, stops). Trading activity is shown only in a part of non-ka-networks. The lowest values of indicators of economic activity in the stratum.

The main quantitative and qualitative characteristics of the cluster : It makes up 10% of the total number of retail outlets in the stratum. This cluster includes 892 retail outlets. Average monthly sales are estimated in the range of 2.4 to 6 thousand rubles. The income of the population is estimated at an average of 27 thousand rubles, which is a low indicator in comparison with indicators of clusters of the first stratum.

The average cost of 1 square meter of housing fluctuates around 47-53 thousand rubles, which is also lower than the indicators of stratum 1. The average cost of renting a one-room apartment is 12 thousand rubles. The number of MPN of any type within a radius of 1000 meters is from 2 to 5 pieces. The presence of non-ka-network outlets within a radius of 1000 meters is 9-30 pieces. The complete absence of outlets of ka-networks within a radius of 1000 meters - none nearby. The number of railway stations within a radius of 1000 meters is no more than two pieces. Ground public transport stops within a radius of 1000 meters - an average of two pieces.

Geographic characteristics of the cluster : The low distance to the nearest MPN of any type is not far. The distance from the nearest railway station is high, the characteristic is far. The distance to the nearest ground public transport stop will also be high - far. The distance from the nearest non-ka-network outlet is insignificant, the characteristic is close. And the distance to the nearest outlet of the ka-network is large - the outlets are far away.

· 6th cluster - These are residential, sleeping areas of cities .

Cluster Profile : Average indicators of trading activity due to non-ka-networks and indicators of economic activity due to closely spaced MPNs;

The main quantitative and qualitative characteristics of the cluster : The cluster makes up 15% of the total number of outlets in the stratum and includes 1345 outlets. Monthly sales are estimated at 3-6 thousand rubles. The average income of the population is 26 thousand rubles, which is the average for this stratum. The average cost of 1 square meter of housing is 53 thousand rubles, and the average cost of renting a one-room apartment will be 12 thousand rubles, as in the previous cluster. The number of MPNs of any type within a radius of 1000 meters is 18-25 pieces, and retail outlets of non-ka-networks within a radius of 1000 meters range from 30 to 44 pieces. The number of outlets of ka-networks within a radius of 1000 metro is on average 6-9 pieces - a high figure. There are no more than two railway stations within a radius of 1000 meters. The complete absence of ground public transport stops within a radius of 1000 meters.

The distance to the nearest MPN of any type is low - nearby,

and close to the nearest railway station.

The distance to the nearest stop of ground public transport is high, the characteristic is far. It is not close to the nearest outlet of the ka-network, as well as to the nearest outlet of the ka-network.

· 7th cluster - these are areas close to the center, cities, near highways

Cluster Profile : High indicators of trading activity and infrastructure development (land transport stops), average indicators of MPN.

The main quantitative and qualitative characteristics of the cluster: It makes up 34% of the total number of outlets in the stratum. This cluster includes 3194 outlets and is the largest in the stratum, along with the 8th cluster.

Monthly sales are estimated in the range of 2 to 6 thousand rubles.

The average income of the population is 28 thousand rubles.

The average cost of 1 square meter of housing is 42-49, which is lower than similar indicators in the 5th and 6th clusters.

The average cost of renting a one-room apartment practically does not differ from the previously considered clusters of this stratum and amounts to 11-12 thousand rubles.

The number of MPNs of arbitrary type within a radius of 1000 meters is 21-33, and the number of non-ka-network outlets within a radius of 1000 meters is about 50. The number of ka-network outlets within a radius of 1000 meters is on average 7-10. There are no railway stations within a radius of 1000 meters.

There are about 14 ground public transport stops within a radius of 1000 meters.

Geographic characteristics of the cluster : Low distance to the nearest MPN of any type, high distance to the nearest railway station. Not far from the nearest surface public transport stop. The distance to the nearest non-ka-network outlet is low, the characteristic is nearby. It is also close to the nearest ka-network outlet.

· 8th cluster - these are the centers of small towns (~500 thousand people).

Cluster Profile : Significant number of MPNs, high concentration of trade zones, low infrastructure indicators.

The main quantitative and qualitative characteristics of the cluster: It makes up 34% of the total number of outlets in the stratum. This cluster includes 3191 outlets and is the largest in the stratum, along with the 7th cluster. The average sales data for the month is 3-8 thousand rubles. The average monthly income of the population is estimated at 28 thousand rubles. The average cost of 1 square meter of housing is 47 - 50 thousand rubles, and the average cost of renting a one-room apartment is 12 thousand rubles. The number of MPN of any type within a radius of 1000 meters is on average 28-40 pieces, the presence of retail outlets of non-ka-networks within a radius of 1000 meters - from 38 to 52 pieces. Availability of outlets of ka-networks within a radius of 1000 meters - from 7 to 11 units. There are no railway stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is very low, there are almost none.

Geographic characteristics of the cluster : The nearest MPN of any type is nearby. The distance to the nearest railway station is high, the characteristic is far. The distance from the nearest ground public transport stop is also high - far. The nearest retail outlet is not ka-network close. The distance to the nearest ka-network outlet is close.

· 9th cluster - these are the centers of cities, with a population of up to 1 million people.

Cluster Profile : The highest values of indicators of economic and trade activity in the stratum.

The main quantitative and qualitative characteristics of the cluster : It makes up 7% of the total number of retail outlets in the stratum. This cluster includes 647 retail outlets and is the smallest in the stratum. Monthly sales are 6-8 thousand rubles and this is higher than similar indicators for other clusters of this stratum. The income of the population, as in other clusters of the stratum, is estimated at 28 thousand rubles. The average cost of 1 square meter of housing is 50-53 thousand rubles. The average cost of renting a one-room apartment also does not differ from similar indicators in other clusters of the stratum and is equal to 12 thousand rubles.

The number of MPNs of arbitrary type within a radius of 1000 meters is 90 pieces and is a very high indicator, and non-ka-network outlets within a radius of 1000 meters - 155 pieces, which can also be called a very high indicator. The number of outlets of ka-networks within a radius of 1000 meters is 20-21 units. There are no railway stations within a radius of 1000 meters.

The number of ground public transport stops within a radius of 1000 meters is about 15-18.

Geographic characteristics of the cluster : The nearest MPN of any type is nearby, and the nearest railway station is far away. Close to the nearest ground public transport stop. The distance to the nearest non-ka-network outlet is low, it is nearby, and the nearest ka-network outlet is also close.

stratum. For the third stratum (1958 outlets), by applying the k-means method (Chapter 2, paragraph 2.3), the optimal partition into 2 clusters according to 13 features was obtained, since there are no outlets close to the metro in this stratum. The number of clusters was chosen based on the optimization of the Akaike criterion.

Figure 36 Share of clusters in the third stratum

· 10th cluster - These are remote areas and cities, with a smaller population.

Cluster Profile : Low economic activity, the average degree of trading activity.

The main quantitative and qualitative characteristics of the cluster: It makes up 55% of the total number of retail outlets in the stratum. This cluster includes 1084 retail outlets. The income of the population is estimated at 24 thousand rubles, which is lower than the indicators of the 1st and 2nd strata. Average monthly sales are estimated at 18 thousand rubles, which is significant higher than the indicators of the 1st and 2nd strata. It is characterized by the absence of MPN of any type within a radius of 1000 meters. The number of outlets of non-ka-networks within a radius of 1000 meters is from 15 to 40 pieces. There are 3 outlets of ka-networks within a radius of 1000 meters. As a rule, there are no railway stations within a radius of 1000 meters .Stops of ground public transport within a radius of 1000 meters, 75% of points do not have, the remaining 25% - up to 20 pieces.

Geographic characteristics of the cluster: There are no MPNs of any type nearby, and there are no railway stations either. There are no public transport stops nearby. The distance to the nearest non-ka-network outlet is low - it is nearby, and the nearest ka-network outlet is also close.

· 11th cluster - centers of small towns, shopping areas.

Cluster profile: Significant degree of economic and trade activity.

As a rule, there are no railway stations within a radius of 1000 meters.

The number of ground public transport stops within a radius of 1000 meters: 75% of outlets do not have, the remaining 25% - up to 22.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is low, and there are no railway stations nearby, as well as ground public transport stops. The distance to the nearest non-ka-network outlet is low, the outlets are nearby. The distance to the nearest ka-network outlet is also low.

stratum. For the fourth stratum (20,969 outlets), by applying the k-means method (Chapter 2, paragraph 2.3), an optimal partition into 4 clusters according to 12 features was obtained, since there are no outlets close to the transport infrastructure in this stratum. The number of clusters was chosen based on the optimization of the Akaike criterion.

Figure 37 Share of clusters in the fourth stratum

· 12th cluster - outskirts of small towns.

Cluster Profile : the lowest income indicators, there is no transport infrastructure, there are several shops.

The main quantitative and qualitative characteristics of the cluster: It makes up 37% of the total number of retail outlets in the stratum. This cluster includes 7682 retail outlets. The income of the population is estimated at 18-20 thousand rubles, which is significantly lower than similar indicators in other strata.

Monthly sales are 19-35 thousand rubles. There are no MPNs of any type within a radius of 1000 meters. The number of outlets not ka-networks within a radius of 1000 meters is 3 - 8 pieces. Lack of outlets of ka-networks within a radius of 1000 meters. There are no railway stations within a radius of 1000 meters, as well as ground public transport stops. The distance to the nearest MPN arbitrary is large - far. The nearest railway station is also far away. The distance from the nearest stop of ground public transport is high - far. The nearest outlet is not close to the ka-network, but the nearest outlet of the ka-network is far away.

· 13th cluster - shopping areas of small towns

Cluster Profile : Average trading activity, weak evidence of transport infrastructure.

The main quantitative and qualitative characteristics of the cluster: It makes up 31% of the total number of retail outlets in the stratum. This cluster includes 6,514 retail outlets. The income of the population is estimated at 21-24 thousand rubles, which is significantly lower than that of other strata, but higher than the indicator of the 12th cluster of this stratum.

Monthly sales amount to 21-46 thousand rubles. There are no MPNs of any type within a radius of 1000 meters. There are no railway stations within a radius of 1000 meters.

Most ground public transport stops within a radius of 1000 meters, some have up to 3.

Geographic characteristics of the cluster : Far to the nearest MPN of any type, to the nearest railway station as far as to the nearest stop of surface public transport. Nearest non-ka-network outlet nearby. The distance to the nearest outlet of the ka-network is low - not far (up to 1 km).

· 14th cluster - small settlements with the lowest degree of trading activity

Cluster Profile : The lowest rates of trading activity, with a minimum set of stores. The average level of income of the population.

The main quantitative and qualitative characteristics of the cluster: It makes up 20% of the total number of retail outlets in the stratum. This cluster includes 4188 retail outlets. The income of the population is estimated at 24-26 thousand rubles, which is significantly lower than similar indicators for other strata, but higher than the indicators of the 12th and 13th clusters of this strata. Monthly sales are 21-38 thousand rubles.

The complete absence of MPN of any type within a radius of 1000 meters.

The number of outlets of non-ka networks within a radius of 1000 meters is from 1 to 4, and there are no outlets of ka-networks within a radius of 1000 meters. Lack of railway stations within a radius of 1000 meters. There are no ground public transport stops within a radius of 1000 meters.

Geographic characteristics of the cluster : The nearest MPN of any type is far away, as well as the nearest railway station and the nearest stop of surface public transport. The distance to the nearest ka-network outlet is far.

· 15th cluster - economically active settlements with less than 100 thousand people.

Cluster Profile : The only cluster where there are signs of economic activity in the stratum. The highest rates of trading activity.

The main quantitative and qualitative characteristics of the cluster: It makes up 12% of the total number of retail outlets in the stratum. This cluster includes 2,585 retail outlets. The income of the population is 25-28 thousand rubles, which is significantly lower than that of other strata, but higher than other clusters of this stratum. Monthly sales are 24-52 thousand rubles, which is the highest figure among all strata.

There are 2-7 MPNs of any type within a radius of 1000 meters. The number of retail outlets of non-ka-networks within a radius of 1000 meters is from 14 to 28 pieces, outlets of ka-networks within a radius of 1000 meters from 1 to 4 pieces. Railway stations in within a radius of 1000 meters no. The number of ground public transport stops within a radius of 1000 meters is not for the majority, for some up to 7.

Geographic characteristics of the cluster : It is close to the nearest MPN of any type, and far to the nearest railway station, as well as to the nearest stop of surface public transport. The distance to the nearest outlet is not ka-networks low - they are nearby. The distance to the nearest outlet of the ka-network is up to 500m for half, and far for the rest.

3.7 Clustering the range of outlets

Figure 38 Number of TTs with a grouped assortment

By applying a two-stage method of cluster analysis, the assortment of outlets was divided into 5 clusters. The silhouette measure is 0.2, which is the average quality of the clustering. The dimensions of each of them can be seen in the figure below. The largest cluster is the first, it accounts for almost 59% (17,622 outlets) of all clusters. The smallest cluster 5 is almost 2% - it is 452 outlets. Differences from the clustering of retail outlets: The division of products that are as different as possible from each other, and TTs were combined according to the principle of similarity between them.

17 Share of each cluster

Figure 39 Breadth of assortment in each cluster

· The first cluster - this is the assortment group with the smallest selection. These are sweets or chocolate bars in small packages. Such goods are most likely presented at gas stations or in small tents. Five best-selling products in this cluster: Babaevsky bitter chocolate 100 grams, Alenka chocolate 15 grams, Alenka chocolate 100 grams, confectionery bar " Good company» with waffle crumbs 80 grams and chocolate bar Good Company with peanuts 80 grams.

· Second cluster - such a group of goods with an average choice of assortment refers to stores in cities with a population of more than 250 thousand people. The five best-selling products in this cluster are: Khorosha Kompaniya confectionery bar with wafer crumbs 80 grams, Alenka chocolate 20 grams, Alenka lot of milk chocolate 100 grams, Horoshaya Kompaniya chocolate bar with peanuts 80 grams and Alenka milk chocolate with multi-colored dragees.

· Third cluster - This group contains a small selection of products. These are mainly chocolate products and waffle cakes. Shops in small towns or villages can be attributed to this category of goods. Five best-selling products in this cluster: Alenka chocolate 100 grams, Alenka chocolate 15 grams, Alenka chocolate 20 grams, Moskvichka caramel and Babaevsky bitter chocolate 100 grams

· Fourth cluster - These are clusters with a large selection of assortment. This group of goods belongs to large branded confectionery stores in large cities. The five best-selling products in this cluster are: Alenka chocolate 100 grams, Moskvichka caramel, Babaevsky bitter chocolate 100 grams, Korovka wafers with baked milk flavor and Romashka candy.

· Fifth cluster - these are the clusters with the largest selection of assortment. This group of goods belongs to large branded confectionery stores in satellite cities. The five best-selling products in this cluster are Ptichye Moloko sweets, Moskvichka caramel, Alenka chocolate 100 grams, Babaevsky bitter 100 grams, and Korovka wafers with baked milk flavor.

It can be concluded that the most popular product is Alenka chocolate. It is this product that is found in each cluster in the lead.

Conclusion to the third chapter

The studies carried out by the cluster analysis method helped to divide the outlets into strata by location, then each stratum was divided into clusters. As a result, such a cluster analysis helped to reduce the homogeneity by 1.77. Relationships between socio-demographic indicators (gender, age, income) and consumer behavior were analyzed and identified. Also, a clustering of the assortment of retail outlets was carried out, which made it possible to reveal that the smallest assortment is presented in the largest cluster in terms of the number of outlets.

Conclusion

Big data is not just another hype in the IT market, it is a systematic, high-quality transition to compiling knowledge-based value chains. Its effect can be compared with the appearance of affordable computer technology at the end of the last century. While short-sighted conservatives will apply deeply outdated approaches, enterprises that already use Big Data technologies will find themselves in the leading positions and gain competitive advantages in the market in the future. There is no doubt that all major organizations will implement this technology in the coming years, as it is both present and future.

This graduate work is a scientific, systematic approach to the choice of the location of outlets, and the methods of obtaining and analyzing information, obtaining end result, are very budgetary, allowing such a procedure to be carried out even by individual entrepreneurs with a small turnover of funds.

Given the growth in the rate of accumulation of information, there is an urgent need for data analysis technologies, which, in this regard, are also rapidly developing. The development of these technologies in recent years has made it possible to move from segmenting customers into groups with similar preferences to building models in real time, based, among other things, on customer requests on the Internet and visits to certain pages. It becomes realistic to display specific offers and advertisements based on the analysis of consumer interests, making these offers much more targeted. It is also possible to correct and reconfigure the model in real time.

Cluster analysis can truly be called the most convenient and most optimal tool for identifying market segments. The use of these methods has become especially relevant in the age of high technology, in which it is so important to speed up labor-intensive and lengthy processes with the help of technology. The variables used as the basis for clustering will be the right choice based on the experience of previous studies, theoretical background, various tested hypotheses, and also based on the wishes of the researcher. In addition, it is recommended to take an appropriate measure of similarity. A distinctive feature of hierarchical clustering is the development of a hierarchical structure. The most common and effective dispersion method is the Bard method. Non-hierarchical clustering methods are often referred to as k-means methods. The choice of the clustering method and the choice of the distance measure are interrelated. In hierarchical clustering, an important criterion for deciding on the number of clusters is the distance at which clusters are combined. Cluster sizes should be such that it makes sense to keep this cluster, and not merge it with others. The reliability and validity of clustering solutions are evaluated in different ways.

Bibliography

1. StatSoft - Electronic textbook on statistics

2. Mandel I.D. Cluster analysis., 1988

N. Paklin. "Data Clustering: A Scalable CLOPE Algorithm".

Olenderfer M.S., Blashfield R.K. Cluster analysis / Factor, discriminant and cluster analysis: per. from English; Under. ed. I. S. Enyukova. - M.: Finance and statistics, 1989-215 p.

Daniel Fasulo "Analysis of recent work on clustering algorithms".

Duran B., Odell P. Cluster analysis. M.: Statistics, 1977

Jambue M. Hierarchical Cluster Analysis and Correspondences, 1988

Khaidukov D.S. Application of cluster analysis in public administration// Philosophy of mathematics: actual problems. - M.: MAKS Press, 2009. - 287 p.

Classification and cluster. Ed. J. Wen Raizina. M.: Mir, 1980.

Tryon R.C. Cluster analysis - London:, 1939. - 139 p.

Berikov V. S., Lbov G. S. Modern tendencies in cluster analysis 2008. - 67 p.

Vyatchenin D. A. Fuzzy methods automatic classification. - Minsk: Technoprint, 2004. - 320 p.

I. A. Chubukova Data Mining. Tutorial. - M.: Internet University of Information Technologies;

N. Paklin. "Categorical Data Clustering: A Scalable CLOPE Algorithm".

16. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim "CURE: an efficient clustering algorithm for large databases". Electronic edition.

17. Tian Zhang, Raghu Ramakrishnan, Miron Livny "BIRCH: An Efficient Data Clustering Technique for Very Large Databases".

N. Paklin "Clustering Algorithms in the Service of Data Mining".

Jan Janson "Modeling".

20. I. A. Chubukova Data Mining. Textbook., 2006.

. Accessible Data Analytics by Anil Maheshwari

Kenneth Kekjer "Big data: a revolution that will change the way we live, work and think"

Cathy O'neil and Rachel Schutt “Data Science”

CLUSTER ANALYSIS IN THE PROBLEMS OF SOCIO-ECONOMIC FORECASTING

Introduction to cluster analysis.

When analyzing and forecasting socio-economic phenomena, the researcher often encounters the multidimensionality of their description. This happens when solving the problem of market segmentation, building a typology of countries according to a sufficiently large number of indicators, predicting market conditions individual goods, studying and predicting economic depression and many other problems.

Methods of multivariate analysis are the most effective quantitative tool for studying socio-economic processes described by a large number of characteristics. These include cluster analysis, taxonomy, pattern recognition, and factor analysis.

Cluster analysis most clearly reflects the features of multivariate analysis in classification, factor analysis - in the study of communication.

Sometimes the cluster analysis approach is referred to in the literature as numerical taxonomy, numerical classification, self-learning recognition, etc.

Cluster analysis found its first application in sociology. The name cluster analysis comes from the English word cluster - bunch, accumulation. For the first time in 1939, the subject of cluster analysis was defined and its description was made by the researcher Trion. The main purpose of cluster analysis is to divide the set of objects and features under study into groups or clusters that are homogeneous in the appropriate sense. This means that the problem of classifying data and identifying the corresponding structure in it is being solved. Cluster analysis methods can be applied in a variety of cases, even when it comes to a simple grouping, in which everything comes down to the formation of groups by quantitative similarity.

The great advantage of cluster analysis is that it allows you to partition objects not by one parameter, but by a whole set of features. In addition, cluster analysis, unlike most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration, and allows us to consider a set of initial data of an almost arbitrary nature. This is of great importance, for example, for conjuncture forecasting, when indicators have a variety of forms that make it difficult to use traditional econometric approaches.

Cluster analysis makes it possible to consider a sufficiently large amount of information and drastically reduce, compress large arrays of socio-economic information, making them compact and visual.

Importance cluster analysis is applied to sets of time series characterizing economic development (for example, general economic and commodity conditions). Here it is possible to single out the periods when the values of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

Cluster analysis can be used cyclically. In this case, the study is carried out until the desired results are achieved. At the same time, each cycle here can provide information that can greatly change the direction and approaches of further application of cluster analysis. This process can be represented as a feedback system.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Like any other method, cluster analysis has certain disadvantages and limitations: In particular, the composition and number of clusters depends on the selected partitioning criteria. When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values of the cluster parameters. When classifying objects, very often the possibility of the absence of any cluster values in the considered set is ignored.

In cluster analysis, it is considered that:

a) the selected characteristics allow, in principle, the desired clustering;

b) the units of measurement (scale) are chosen correctly.

The choice of scale plays a big role. Typically, data is normalized by subtracting the mean and dividing by the standard deviation so that the variance is equal to one.

The problem of cluster analysis.

The task of cluster analysis is to split the set of objects G into m (m is an integer) clusters (subsets) Q1, Q2, ..., Qm, based on the data contained in the set X, so that each object Gj belongs to one and only one partition subset and that the objects belonging to the same cluster are similar, while the objects belonging to different clusters are heterogeneous.

For example, let G include n countries, each of which is characterized by GNP per capita (F1), the number M of cars per 1,000 people (F2), per capita electricity consumption (F3), per capita steel consumption (F4), etc. Then X1 (measurement vector) is a set of specified characteristics for the first country, X2 for the second, X3 for the third, and so on. The challenge is to break down countries by level of development.

The solution to the problem of cluster analysis are partitions that satisfy a certain optimality criterion. This criterion can be some functional that expresses the levels of desirability of various partitions and groupings, which is called the objective function. For example, the intragroup sum of squared deviations can be taken as the objective function:

where xj - represents the measurements of the j-th object.

To solve the problem of cluster analysis, it is necessary to define the concept of similarity and heterogeneity.

It is clear that the i-th and j-th objects would fall into the same cluster when the distance (distance) between the points Xi and Xj would be small enough and would fall into different clusters when this distance would be large enough. Thus, getting into one or different clusters of objects is determined by the concept of the distance between Xi and Xj from Ep, where Ep is a p-dimensional Euclidean space. A non-negative function d(Xi, Xj) is called a distance function (metric) if:

a) d(Xi , Xj) ³ 0, for all Xi and Xj from Ep

b) d(Xi, Xj) = 0 if and only if Xi = Xj

c) d(Xi, Xj) = d(Xj, Xi)

d) d(Xi, Xj) £ d(Xi, Xk) + d(Xk, Xj), where Xj; Xi and Xk are any three vectors from Ep.

The value d(Xi, Xj) for Xi and Xj is called the distance between Xi and Xj and is equivalent to the distance between Gi and Gj according to the selected characteristics (F1, F2, F3, ..., Fp).

The most commonly used distance functions are:

1. Euclidean distance d2(Хi , Хj) =

2. l1 - norm d1(Хi , Хj) =

3. Supremum - norm d¥ (Хi , Хj) = sup

k = 1, 2, ..., p

4. lp - norm dр(Хi , Хj) =

The Euclidean metric is the most popular. The l1 metric is the easiest to calculate. The supremum-norm is easy to calculate and includes an ordering procedure, while the lp-norm covers the distance functions 1, 2, 3,.

Let n measurements X1, X2,..., Xn be represented as a p ´n data matrix:

Then the distance between pairs of vectors d(Хi , Хj) can be represented as a symmetrical distance matrix:

The concept opposite to distance is the concept of similarity between Gi objects. and Gj. A non-negative real function S(Хi ; Хj) = Sij is called a similarity measure if: The value Sij is called the similarity coefficient.

1.3. Methods of cluster analysis.

Today there are many methods of cluster analysis. Let us dwell on some of them (the methods given below are usually called the methods of minimum variance).

Let X be the observation matrix: X = (X1, X2,..., Xu) and the square of the Euclidean distance between Xi and Xj is determined by the formula:

1) The method of complete connections.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than some threshold value S. In terms of the Euclidean distance d, this means that the distance between two points (objects) of the cluster should not exceed some threshold value h. Thus, h determines the maximum allowable diameter of a subset forming a cluster.

2) Method of maximum local distance.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any threshold values.

3) Word method.

In this method, the intragroup sum of squared deviations is used as an objective function, which is nothing more than the sum of the squared distances between each point (object) and the average for the cluster containing this object. At each step, two clusters are combined that lead to the minimum increase in the objective function, i.e. intragroup sum of squares. This method is aimed at combining closely spaced clusters.

AT STATISTICS classical methods of cluster analysis are implemented, including k-means, hierarchical clustering and two-input join methods.

Data can come both in its original form and in the form of a matrix of distances between objects.

Observations and variables can be clustered using various distance measures (Euclidean, Euclidean squared, Manhattan, Chebyshev, etc.) and various clustering rules (single, full connection, unweighted and weighted pairwise group averages, etc.).

Formulation of the problem

The original data file contains the following information about vehicles and their owners:

The purpose of this analysis is to divide cars and their owners into classes, each of which corresponds to a certain risk group. Observations that fall into one group are characterized by the same probability of an insured event, which is subsequently assessed by the insurer.

Using cluster analysis to solve this problem is most effective. In the general case, cluster analysis is designed to combine some objects into classes (clusters) in such a way that the most similar ones fall into one class, and objects of different classes differ as much as possible from each other. The similarity score is calculated in a predetermined manner based on the data characterizing the objects.

Measurement scale

All cluster algorithms need to estimate the distances between clusters or objects, and it is clear that when calculating the distance, it is necessary to specify the measurement scale.

Because different measurements use absolutely Various types scales, the data must be standardized (in the menu Data select item Standardize), so that each variable will have a mean of 0 and a standard deviation of 1.

The table with standardized variables is shown below.

Step 1. Hierarchical classification

At the first stage, we will find out whether the cars form "natural" clusters that can be understood.

Let's choose cluster analysis on the menu Analysis - Multivariate exploratory analysis to display module start panel cluster analysis. In this dialog, choose Hierarchical classification and press OK.

Let's press the button Variables, choose All, in field Objects choose Observations (lines). As a union rule, we note Full link method, as a measure of proximity - Euclidean distance. Let's press OK.

The full link method defines the distance between clusters as the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

The proximity measure defined by the Euclidean distance is a geometric distance in n-dimensional space and is calculated as follows:

The most important result of tree clustering is the hierarchical tree. Let's press the button Vertical dendrogram.

Tree diagrams may seem a little confusing at first, but after some study they become more understandable. The diagram starts at the top (for a vertical dendrogram) with each car in its own cluster.

As you start moving down, cars that are "closer to each other" join together and form clusters. Each node of the diagram above represents a union of two or more clusters, the position of the nodes on the vertical axis determines the distance at which the respective clusters were combined.

Step 2. Clustering using the K means method

Based on the visual representation of the results, it can be assumed that the cars form four natural clusters. Let's check this assumption by dividing the initial data by the K means method into 4 clusters, and check the significance of the difference between the groups obtained.

In the launch panel of the module cluster analysis choose Clustering by means of K means.

Let's press the button Variables and choose All, in field Objects choose Observations (lines), we define 4 partition clusters.

Method K-means is as follows: the calculations start with k randomly selected observations (in our case, k=4), which become the centers of the groups, after which the object composition of the clusters is changed in order to minimize the variability within the clusters and maximize the variability between the clusters.

Each following observation (K+1) belongs to the group whose measure of similarity with the center of gravity is minimal.

After changing the composition of the cluster, a new center of gravity is calculated, most often as a vector of averages for each parameter. The algorithm continues until the composition of the clusters stops changing.

When the results of the classification are obtained, you can calculate the average value of the indicators for each cluster in order to assess how much they differ from each other.

In the window Results of the K means method choose Analysis of variance to determine the significance of the difference between the resulting clusters.

So the value of p<0.05, что говорит о значимом различии.

Let's press the button Cluster elements and distances to view the observations included in each of the clusters. The option also allows you to display the Euclidean distances of objects from the centers (mean values) of their respective clusters.

First cluster:

Second cluster:

Third cluster:

Fourth cluster:

So, in each of the four clusters there are objects with a similar impact on the loss process.

Step 3. Descriptive statistics

Knowledge of descriptive statistics in each group is certainly important for any researcher.

Clustering tasks in Data Mining

Introduction to Cluster Analysis

From the entire vast field of application of cluster analysis, for example, the problem of socio-economic forecasting.

When analyzing and forecasting socio-economic phenomena, the researcher often encounters the multidimensionality of their description. This happens when solving the problem of market segmentation, building a typology of countries according to a sufficiently large number of indicators, forecasting the market situation for individual goods, studying and forecasting economic depression, and many other problems.

Multivariate analysis methods are the most effective quantitative tool for studying socio-economic processes described by a large number of characteristics. These include cluster analysis, taxonomy, pattern recognition, and factor analysis.

cluster analysis most clearly reflects the features of multivariate analysis in the classification, factor analysis - in the study of communication.

Sometimes the cluster analysis approach is referred to in the literature as numerical taxonomy, numerical classification, self-learning recognition, etc.

The great advantage of cluster analysis in that it allows splitting objects not by one parameter, but by a whole set of features. In addition, cluster analysis, unlike most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration, and allows us to consider a set of initial data of an almost arbitrary nature. This is of great importance, for example, for conjuncture forecasting, when indicators have a variety of forms that make it difficult to use traditional econometric approaches.

Cluster analysis makes it possible to consider a sufficiently large amount of information and drastically reduce, compress large arrays of socio-economic information, make them compact and visual.

Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions). Here it is possible to single out the periods when the values of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

In the tasks of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Like any other method , cluster analysis has certain disadvantages and limitations: In particular, make up the number of clusters depends on the selected partitioning criteria. When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values of the cluster parameters. When classifying objects, very often the possibility of the absence of any cluster values in the considered set is ignored.

In cluster analysis, it is considered that:

a) the selected characteristics allow, in principle, the desired clustering;

b) the units of measurement (scale) are chosen correctly.

The choice of scale plays a big role. Typically, data is normalized by subtracting the mean and dividing by the standard deviation so that the variance is equal to one.

1. The task of clustering

The task of clustering is to, based on the data contained in the set X, split a lot of objects G on the m (m– whole) clusters (subsets) Q1,Q 2 , …,Qm, so that each object Gj belong to one and only one partition subset and that objects belonging to the same cluster are similar, while objects belonging to different clusters are heterogeneous.

For example, let G includes n countries, any of which is characterized by GNP per capita ( F1), number M cars per 1,000 people F2), per capita electricity consumption ( F3), per capita steel consumption ( F4) etc. Then X 1(measurement vector) is a set of specified characteristics for the first country, X 2- for the second, X 3 for the third, and so on. The challenge is to break down countries by level of development.

where xj- represents measurements j-th object.

To solve the problem of cluster analysis, it is necessary to define the concept of similarity and heterogeneity.

It is clear that the objects i -th and j-th would fall into one cluster when the distance (remoteness) between points X i and X j would be small enough and would fall into different clusters when this distance would be large enough. Thus, hitting one or different clusters of objects is determined by the concept of the distance between X i and X j from yer, where yer - R-dimensional Euclidean space. Non-negative function d(X i, Х j) is called a distance function (metric) if:

a) d(Xi , Х j)³ 0 , for all X i and X j from yer

b) d(Xi , Х j) = 0, if and only if X i= Х j

in) d(Xi , X j) = d(X j , X i)

G) d(Xi , Х j)£ d(Xi , X k) + d(X k , X j), where X j ; Xi and Х k- any three vectors from yer.

Meaning d(Xi , Х j) for Xi and X j is called the distance between Xi and X j and is equivalent to the distance between Gi and Gj according to the selected characteristics (F 1, F 2, F 3, ..., F p).

The most commonly used distance functions are:

1. Euclidean distance d 2 (Xi , Х j) =

2. l 1- norm d 1 (Xi , Х j) =

3. Supremum - the norm d ¥ (Xi , Х j) = sup

k = 1, 2, ..., p

4. lp- norm d p (Xi , Х j) =

The Euclidean metric is the most popular. The l 1 metric is the easiest to calculate. The supremum norm is easy to calculate and includes an ordering procedure, a lp- the norm covers the functions of distances 1, 2, 3,.

Let n measurements X 1, X 2,..., Xn are presented in the form of a data matrix with the size p´ n:

Then the distance between the pairs of vectors d(X i, Х j) can be represented as a symmetrical distance matrix:

The concept opposite to distance is the concept of similarity between objects. G i . and Gj. Non-negative real function S(X i; X j) = S i j is called a similarity measure if:

1) 0 £ S(X i , X j)< 1 for X i ¹ X j

2) S( Xi, Xi) = 1

3) S( Xi, Xj) = S(Xj, X i )

Pairs of similarity measure values can be combined into a similarity matrix:

the value Sij called the coefficient of similarity.

2. Clustering methods

Today there are many methods of cluster analysis. Let us dwell on some of them (the methods given below are usually called the methods of minimum variance).

Let be X- observation matrix: X \u003d (X 1, X 2, ..., X u) and the square of the Euclidean distance between X i and X j is determined by the formula:

1) Full connection method.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than a certain threshold value S. In terms of Euclidean distance d this means that the distance between two points (objects) of the cluster should not exceed some threshold valueh. Thus, hdefines the maximum allowable diameter of a subset forming a cluster.

2) Maximum local distance method.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any thresholds.

3) Word method.

4) centroid method.

The distance between two clusters is defined as the Euclidean distance between the centers (averages) of these clusters:

d2ij =(` X-` Y) T (` X-` Y) Clustering proceeds in stages on each of n–1 steps unite two clusters G and p having the minimum value d2ij If a n 1 much more n 2, then the merging centers of two clusters are close to each other, and the characteristics of the second cluster are practically ignored when clusters are merged. Sometimes this method is sometimes also called the method of weighted groups.

3. Sequential clustering algorithm

Consider Ι = (Ι 1 , Ι 2 , … Ιn) as many clusters (Ι 1 ), (Ι 2 ),…(Ιn). Let's choose two of them, for example, Ι i and Ιj, which are in some sense closer to each other and combine them into one cluster. The new set of clusters, already consisting of n -1 clusters, will be:

(Ι 1 ), (Ι 2 )…, {Ι i, Ι j ), …, (Ιn).

Repeating the process, we obtain successive sets of clusters consisting of (n-2), (n-3), (n-4) etc. clusters. At the end of the procedure, you can get a cluster consisting of n objects and coinciding with the original set Ι = (Ι 1 , Ι 2 , … Ιn).

As a measure of distance, we take the square of the Euclidean metric d i j2. and calculate the matrix D = (di j 2 ), where di j 2 is the square of the distance between

Ι i and Ιj:

			….	Ι n
	d 12 2	d 13 2	….	d 1n 2
		d 23 2	….	d 2n 2
			….	d 3n 2
….			….	….
Ι n

Let the distance between Ι i and Ι j will be minimal:

d i j 2 = min (d i j 2 , i¹ j). We form with Ι i and Ι j new cluster

{Ι i , Ι j ). Let's build a new ((n-1), (n-1)) distance matrix

	( Ι i , Ι j )				….	Ι n
( Ι i ; Ι j )		d i j 2 1	d i j 2 2		….	d i j 2 n
			d 12 2	d 1 3	….	d 1 2 n
					….	d2n
					….	d3n

(n-2) the rows for the last matrix are taken from the previous one, and the first row is recomputed. Computations can be kept to a minimum if one can express d i j 2 k ,k = 1, 2,…,n (k¹ i¹ j) through the elements of the original matrix.

Initially, the distance was determined only between single-element clusters, but it is also necessary to determine the distances between clusters containing more than one element. This can be done in various ways, and depending on the chosen method, we get cluster analysis algorithms with different properties. One can, for example, put the distance between the cluster i + j and some other cluster k, equal to the arithmetic mean of the distances between clusters i and k and clusters j and k:

d i+j,k = ½ (d i k + d j k).

But one can also define d i+j,k as the minimum of these two distances:

d i+j,k = min(d i k + d j k).

Thus, the first step of the agglomerative hierarchical algorithm operation is described. The next steps are the same.

A fairly wide class of algorithms can be obtained if the following general formula is used to recalculate distances:

d i+j,k = A(w) min(d ik d jk) + B(w) max(d ik d jk), where

A(w) = ifdik£ djk

A(w) = ifdik> djk

B(w) = ifd i k £ djk

B(w ) =, ifdik> djk

where n i and nj- number of elements in clusters i and j, a w is a free parameter, the choice of which determines a particular algorithm. For example, when w = 1 we get the so-called "average connection" algorithm, for which the formula for recalculating distances takes the form:

d i+j,k =

In this case, the distance between two clusters at each step of the algorithm turns out to be equal to the arithmetic mean of the distances between all pairs of elements such that one element of the pair belongs to one cluster, the other to another.

The visual meaning of the parameter w becomes clear if we put w® ¥ . The distance conversion formula takes the form:

d i+j,k =min (d i,kdjk)

This will be the so-called “nearest neighbor” algorithm, which makes it possible to select clusters of an arbitrarily complex shape, provided that different parts of such clusters are connected by chains of elements close to each other. In this case, the distance between two clusters at each step of the algorithm turns out to be equal to the distance between the two closest elements belonging to these two clusters.

Quite often it is assumed that the initial distances (differences) between the grouped elements are given. In some cases, this is true. However, only objects and their characteristics are specified, and the distance matrix is built based on these data. Depending on whether distances between objects or between characteristics of objects are calculated, different methods are used.

In the case of cluster analysis of objects, the most common measure of difference is either the square of the Euclidean distance

(where x ih , x jh- values h-th sign for i th and j-th objects, and m is the number of characteristics), or the Euclidean distance itself. If features are assigned different weights, then these weights can be taken into account when calculating the distance

Sometimes, as a measure of difference, the distance is used, calculated by the formula:

which are called: "Hamming", "Manhattan" or "city-block" distance.

A natural measure of the similarity of characteristics of objects in many problems is the correlation coefficient between them

where m i ,m j ,d i ,d j- respectively, the average and standard deviations for the characteristics i and j. A measure of the difference between the characteristics can be the value 1-r. In some problems, the sign of the correlation coefficient is insignificant and depends only on the choice of the unit of measure. In this case, as a measure of the difference between the characteristics, ô 1-r i j ô

4. Number of clusters

Highly important issue is the problem of choosing the required number of clusters. Sometimes m number of clusters can be chosen a priori. However, in the general case, this number is determined in the process of splitting the set into clusters.

Studies were carried out by Fortier and Solomon, and it was found that the number of clusters must be taken to achieve the probability a finding the best partition. Thus, the optimal number of partitions is a function of the given fraction b the best or, in some sense, admissible partitions in the set of all possible ones. The total scattering will be the greater, the higher the proportion b allowable splits. Fortier and Solomon developed a table from which one can find the number of partitions needed. S(a , b ) depending on the a and b (where a is the probability that the best partition is found, b is the share of the best partitions in the total number of partitions) Moreover, as a measure of heterogeneity, not the scattering measure is used, but the membership measure introduced by Holzenger and Harman. Table of values S(a , b ) below.

Table of valuesS(a , b )

b \ a	0.20	0.10	0.05	0.01	0.001	0.0001
0.20	8	11	14	21	31	42
0.10	16	22	29	44	66	88
0.05	32	45	59	90	135	180
0.01	161	230	299	459	689	918
0.001	1626	2326	3026	4652	6977	9303
0.0001	17475	25000	32526	55000	75000	100000

Quite often, the criterion for combining (the number of clusters) is the change in the corresponding function. For example, sums of squared deviations:

The grouping process must correspond here to a sequential minimum increase in the value of the criterion E. The presence of a sharp jump in the value E can be interpreted as a characteristic of the number of clusters that objectively exist in the population under study.

So the second way to define best number clusters is reduced to identifying jumps determined by the phase transition from a strongly coupled to a weakly coupled state of objects.

5. Dendograms

The best known method of representing a distance or similarity matrix is based on the idea of a dendogram or tree diagram. A dendrogram can be defined as graphic image the results of the sequential clustering process, which is carried out in terms of the distance matrix. With the help of a dendogram, it is possible to graphically or geometrically depict the clustering procedure, provided that this procedure operates only with elements of the distance or similarity matrix.

There are many ways to construct dendrograms. In the dendrogram, the objects are located vertically on the left, the clustering results are on the right. Distance or similarity values corresponding to the structure of new clusters are displayed along a horizontal straight line over dendrograms.

Fig1

Figure 1 shows one example of a dendrogram. Figure 1 corresponds to the case of six objects ( n=6) and kcharacteristics (signs). Objects BUT and With are the closest and therefore are combined into one cluster at the proximity level equal to 0.9. ObjectsDand E combined at the level of 0.8. Now we have 4 clusters:

(A, C), (F), ( D, E), ( B) .

Further clusters are formed (A, C, F) and ( E, D, B) , corresponding to the proximity level equal to 0.7 and 0.6. Finally, all objects are grouped into one cluster at a level of 0.5.

The type of dendogram depends on the choice of similarity measure or distance between the object and the cluster and the clustering method. The most important point is the choice of a measure of similarity or a measure of distance between an object and a cluster.

The number of cluster analysis algorithms is too large. All of them can be divided into hierarchical and non-hierarchical.

Hierarchical algorithms are associated with the construction of dendograms and are divided into:

a) agglomerative, characterized by a consistent combination of initial elements and a corresponding decrease in the number of clusters;

b) divisible (divisible), in which the number of clusters increases, starting from one, as a result of which a sequence of splitting groups is formed.

Cluster analysis algorithms today have a good software implementation that allows solving problems of the highest dimension.

6. Data

Cluster analysis can be applied to interval data, frequencies, binary data. It is important that the variables change on comparable scales.

Heterogeneity of units of measurement and the resulting impossibility of a reasonable expression of values various indicators in one scale leads to the fact that the value of the distances between points, reflecting the position of objects in the space of their properties, turns out to be dependent on an arbitrarily chosen scale. To eliminate the heterogeneity of the measurement of the initial data, all their values are preliminarily normalized, i.e. are expressed through the ratio of these values to a certain value that reflects certain properties of this indicator. The normalization of initial data for cluster analysis is sometimes carried out by dividing the initial values by the standard deviation of the corresponding indicators. Another way is to calculate the so-called standardized contribution. It is also called Z-contribution.

Z -contribution shows how many standard deviations a given observation separates from the mean:

Where x iis the value of this observation,- the average, S- standard deviation.

Average for Z -contribution is zero and the standard deviation is 1.

Standardization allows comparison of observations from different distributions. If the distribution of a variable is normal (or close to normal) and the mean and variance are known or estimated from large samples, then Z -observation input provides more specific information about its location.

Note that normalization methods mean the recognition of all features as equivalent from the point of view of elucidating the similarity of the objects under consideration. It has already been noted that in relation to the economy, the recognition of the equivalence of various indicators does not always seem justified. It would be desirable, along with normalization, to give each of the indicators a weight that reflects its significance in the course of establishing similarities and differences between objects.

In this situation, one has to resort to the method of determining the weights of individual indicators - a survey of experts. For example, when solving the problem of classifying countries by level economic development used the results of a survey of 40 leading Moscow experts on the problems of developed countries on a ten-point scale:

generalized indicators of socio-economic development - 9 points;

indicators of sectoral distribution of the employed population - 7 points;

indicators of the prevalence of hired labor - 6 points;

indicators characterizing the human element of the productive forces - 6 points;

indicators of the development of material productive forces - 8 points;

indicator of public spending - 4 points;

"military-economic" indicators - 3 points;

socio-demographic indicators - 4 points.

The experts' estimates were relatively stable.

Expert assessments provide a well-known basis for determining the importance of indicators included in a particular group of indicators. Multiplying the normalized values of indicators by a coefficient corresponding to the average score of the assessment makes it possible to calculate the distances between points that reflect the position of countries in a multidimensional space, taking into account the unequal weight of their features.

Quite often, when solving such problems, not one, but two calculations are used: the first, in which all signs are considered equivalent, the second, where they are given different weights in accordance with the average values of expert estimates.

7. Application of cluster analysis

Let's consider some applications of cluster analysis.

1. The division of countries into groups according to the level of development.

65 countries were studied according to 31 indicators (national income per capita, the share of the population employed in industry in %, savings per capita, the share of the population employed in agriculture in %, average life expectancy, number of cars per 1 thousand inhabitants, number of armed forces per 1 million inhabitants, share of industrial GDP in %, share of agricultural GDP in %, etc.)

Each of the countries acts in this consideration as an object characterized by certain values of 31 indicators. Accordingly, they can be represented as points in a 31-dimensional space. Such a space is usually called the property space of the objects under study. Comparison of the distance between these points will reflect the degree of proximity of the countries under consideration, their similarity to each other. The socio-economic meaning of this understanding of similarity means that countries are considered the more similar, the smaller the differences between the same indicators with which they are described.

The first step of such an analysis is to identify the pair of national economies included in the similarity matrix, the distance between which is the smallest. These will obviously be the most similar, similar economies. In the following consideration, both of these countries are considered a single group, a single cluster. Accordingly, the original matrix is transformed so that its elements are the distances between all possible pairs of not 65, but 64 objects - 63 economies and a newly transformed cluster - a conditional union of the two most similar countries. Rows and columns corresponding to the distances from a pair of countries included in the union to all the others are discarded from the original similarity matrix, but a row and column are added containing the distance between the cluster obtained by the union and other countries.

The distance between the newly obtained cluster and the countries is assumed to be equal to the average of the distances between the latter and the two countries that make up the new cluster. In other words, the combined group of countries is considered as a whole with characteristics approximately equal to the average of the characteristics of its constituent countries.

The second step of the analysis is to consider a matrix transformed in this way with 64 rows and columns. Again, a pair of economies is identified, the distance between which is of the least importance, and they, just as in the first case, are brought together. In this case, the smallest distance can be both between a pair of countries, and between any country and the union of countries obtained at the previous stage.

Further procedures are similar to those described above: at each stage, the matrix is transformed so that two columns and two rows containing the distance to objects (pairs of countries or associations - clusters) brought together at the previous stage are excluded from it; the excluded rows and columns are replaced by a column with a row containing the distances from the new joins to the rest of the objects; further, in the modified matrix, a pair of the closest objects is revealed. The analysis continues until the complete exhaustion of the matrix (i.e., until all countries are brought together). The generalized results of the matrix analysis can be represented in the form of a similarity tree (dendogram), similar to that described above, with the only difference that the similarity tree, which reflects the relative proximity of all 65 countries we are considering, is much more complicated than the scheme in which only five national economies appear. This tree, according to the number of matched objects, includes 65 levels. The first (lower) level contains points corresponding to each country separately. The connection of these two points at the second level shows a pair of countries that are closest in terms of the general type of national economies. At the third level, the next most similar paired ratio of countries is noted (as already mentioned, either a new pair of countries or a new country and an already identified pair of similar countries can be in this ratio). And so on up to the last level, at which all the studied countries act as a single set.

As a result of applying cluster analysis, the following five groups of countries were obtained:

Afro-Asian group

Latin-Asian group;

Latin-Mediterranean group;

group of developed capitalist countries (without the USA)

USA

The introduction of new indicators beyond the 31 indicators used here, or their replacement by others, naturally leads to a change in the results of the country classification.

2. The division of countries according to the criterion of proximity of culture.

As you know, marketing must take into account the culture of countries (customs, traditions, etc.).

The following groups of countries were obtained through clustering:

· Arabic;

Middle Eastern

· Scandinavian;

German-speaking

· English-speaking;

Romanesque European;

· Latin American;

Far East.

3. Development of a zinc market forecast.

Cluster analysis plays an important role at the stage of reduction of the economic and mathematical model of commodity conjuncture, contributing to the facilitation and simplification of computational procedures, ensuring greater compactness of the results obtained while maintaining the required accuracy. The use of cluster analysis makes it possible to divide the entire initial set of market indicators into groups (clusters) according to the relevant criteria, thereby facilitating the selection of the most representative indicators.

Cluster analysis is widely used to model market conditions. In practice, the majority of forecasting tasks are based on the use of cluster analysis.

For example, the task of developing a forecast of the zinc market.

Initially, 30 key indicators of the global zinc market were selected:

X 1 - time

Production figures:

X 2 - in the world

X 4 - Europe

X 5 - Canada

X 6 - Japan

X 7 - Australia

Consumption indicators:

X 8 - in the world

X 10 - Europe

X 11 - Canada

X 12 - Japan

X 13 - Australia

Producer stocks of zinc:

X 14 - in the world

X 16 - Europe

X 17 - other countries

Consumer stocks of zinc:

X 18 - in the USA

X 19 - in England

X 10 - in Japan

Import of zinc ores and concentrates (thousand tons)

X 21 - in the USA

X 22 - in Japan

X 23 - in Germany

Export of zinc ores and concentrates (thousand tons)

X 24 - from Canada

X 25 - from Australia

Import of zinc (thousand tons)

X 26 - in the USA

X 27 - to England

X 28 - in Germany

Export of zinc (thousand tons)

X 29 - from Canada

X 30 - from Australia

To determine specific dependencies, the apparatus of correlation and regression analysis was used. Relationships were analyzed on the basis of a matrix of paired correlation coefficients. Here, the hypothesis of the normal distribution of the analyzed indicators of the conjuncture was accepted. It is clear that r ij are not the only possible indicator of the relationship between the indicators used. The need to use cluster analysis in this problem is due to the fact that the number of indicators affecting the price of zinc is very large. There is a need to reduce them for a number of the following reasons:

a) lack of complete statistical data for all variables;

b) a sharp complication of computational procedures when a large number of variables are introduced into the model;

c) the optimal use of regression analysis methods requires the excess of the number of observed values over the number of variables by at least 6-8 times;

d) the desire to use statistically independent variables in the model, etc.

It is very difficult to carry out such an analysis directly on a relatively bulky matrix of correlation coefficients. With the help of cluster analysis, the entire set of market variables can be divided into groups in such a way that the elements of each cluster are strongly correlated with each other, and representatives of different groups are characterized by a weak correlation.

To solve this problem, one of the agglomerative hierarchical cluster analysis algorithms was applied. At each step, the number of clusters is reduced by one due to the optimal, in a certain sense, union of two groups. The criterion for joining is to change the corresponding function. As a function of this, the values of the sums of squared deviations calculated by the following formulas were used:

(j = 1, 2, …,m ),

where j- cluster number, n- the number of elements in the cluster.

rij-coefficient of pair correlation.

Thus, the grouping process must correspond to a sequential minimum increase in the value of the criterion E.

At the first stage, the initial data array is presented as a set consisting of clusters, including one element each. The grouping process begins with the union of such a pair of clusters, which leads to a minimum increase in the sum of squared deviations. This requires estimating the values of the sum of squared deviations for each of the possible cluster associations. At the next stage, the values of the sums of squared deviations are considered already for clusters, etc. This process will be stopped at some step. To do this, you need to monitor the value of the sum of squared deviations. Considering a sequence of increasing values, one can catch a jump (one or more) in its dynamics, which can be interpreted as a characteristic of the number of groups "objectively" existing in the population under study. In the above example, jumps took place when the number of clusters was 7 and 5. Further, the number of groups should not be reduced, because this leads to a decrease in the quality of the model. After obtaining the clusters, the variables most important in economic sense and most closely related to the chosen market criterion - in this case, the London Metal Exchange quotes for zinc. This approach allows you to save a significant part of the information contained in the original set of initial indicators of the conjuncture.

Many of us have heard the phrase "cluster analysis", but not everyone understands what it means. Besides, it sounds more than mysterious! In fact, this is just the name of a method for dividing a data sample into categories of elements according to certain criteria. For example, cluster analysis allows you to divide people into groups with high, medium and low self-esteem. Simply put, a cluster is a type of objects that are similar in a certain way.

Cluster analysis: problems in use

Deciding to use in my research this method, it must be remembered that the clusters selected in its course can be unstable. Therefore, as in the case of factor analysis, you need to check the results on another group of objects or after a certain period of time calculate the measurement error. Moreover, it is best to use cluster analysis on large samples selected by randomization or stratification, because this is the only way to draw a scientific conclusion using induction. He showed himself best in testing hypotheses, and not in creating them from scratch.

Hierarchical cluster analysis

If you need to classify random elements quickly, then you can start by treating each of them initially as a separate cluster. This is the essence of one of the easiest to understand types of cluster analysis. Using it, the researcher at the second stage forms pairs of elements that are similar in the desired feature, and then connects them together the required number of times. Clusters located at a minimum distance between themselves are determined using an integrative procedure. It is repeated until one of the following criteria is met:

obtaining a pre-planned number of clusters;
each of the clusters contains the required number of elements;
each group has the necessary ratio of heterogeneity and homogeneity within it.

In order to correctly calculate the distance between clusters, the following methods are often used:

single and full communication;
King's mean relationship;
centroid method;
taking group averages.

To evaluate the results of clustering, the following criteria are used:

clarity index;
split ratio;
ordinary, normalized and modified entropy;
second and third Rubens functional.

Cluster analysis methods

Most often, when analyzing a sample of objects, the minimum distance method is used. It consists in the fact that elements with a similarity coefficient that is greater than a threshold value are combined into a cluster. When using the local distance method, two clusters are distinguished: the distance between the points of the first of them is the maximum, and the second is the minimum. The centroid clustering method involves calculating the distances between the average values of indicators in groups. And the Word method is most rationally used for grouping clusters close in the studied parameter.

What is cluster analysis used for? Cluster analysis is a study by dividing a set of objects into homogeneous groups

Introduction

Chapter 1. Theoretical basis big data analysis

1.1 About Big Data

.2 Map-Reduce

.3 Data Mining for Big Data

1.4 Tasks solved by Data Mining methods

Conclusion to the first chapter

Chapter 2. Cluster Analysis for Big Data

.1 Choosing a clustering method

.2 Hierarchical methods

.3 Non-hierarchical methods

.4 Comparison of types of clustering

.5 Statistics related to cluster analysis

Conclusion to the second chapter

Chapter 3

.1 Client profile

.2 Compliance analysis

.3 Main idea of ​​cluster analysis

.4 Features for clustering

.5 Identification of points homogeneous in location

.5.1 Final stratification

.6 Clustering objects into homogeneous groups

.7 Assortment clustering outlets

Conclusion to the third chapter

Conclusion

Bibliography

Introduction

Chapter 1. Theoretical Foundations of AnalysisBigData

.1 About Big Data

Characteristics. In modern sources, the concept of Big Data is defined as volume data in the order of terabytes. Signs of Big Data can be defined as "three V": volume - volume; variety - heterogeneity, set; velocity - speed (requires very fast processing).

Sources of big data. Just as data storage formats have changed, data sources have also evolved and are constantly expanding. Data needs to be stored in a wide variety of formats.

With the development and advancement of technology, the amount of data that is generated is constantly growing. Big data sources can be divided into six different categories as shown below.

Implementation examples. As an example of the implementation of this technology, the Hadoop project is most often cited, which is designed to implement distributed computing used to process impressive amounts of data.

This project is being developed by the Apache Software Foundation. Cloudera is supporting this project commercially.

Developers from various countries of the world are involved in the project as participants. information clustering provider

Technologically, Apache Hadoop can be called a free Java framework that supports the execution of distributed applications running on large clusters built on standard hardware.

Since data processing is performed on a cluster of servers, if one of them fails, the work will be redistributed among other working ones.

It is also necessary to say about the implementation of the MapReduce technology in Hadoop, the main task of which is the automatic parallelization of data and their processing on clusters.

The core of Hadoop is a fault-tolerant distributed file system HDFS (Hadoop Distributed File System), which operates storage systems.

The essence of the system is to split the incoming data into blocks, for which there is a specially allocated position in the server pool for each of them. The system makes it possible for applications to scale. A tier will be thousands of nodes and petabytes of data.

1.2 Map-Reduce

1.3 Data Miningforworkwithbig data

.4 Tasks solved by Data Mining methods

Conclusion to the first chapter

Chapter 2. Cluster analysis forBigData

2.1 Clustering methods

.2 Hierarchical methods

2.3 Non-hierarchical methods

2.4 Comparison of types of clustering

2.5 Statistics related to cluster analysis

Conclusion to the second chapter

Chapter 3

.1 Client profile

.2 Compliance analysis

.3 Main idea of ​​cluster analysis

3.4 Features for clustering

3.5 Identification of points that are homogeneous in location

.5.1 Final division into strata

.6 Clustering objects into homogeneous groups

3.7 Clustering the range of outlets

Conclusion to the third chapter

Conclusion

Bibliography

Formulation of the problem

Measurement scale

Step 1. Hierarchical classification

Step 2. Clustering using the K means method

Step 3. Descriptive statistics

Table of valuesS(a , b )

b \ a

0.20

0.10

0.05

0.01

0.001

0.0001

0.20

8

11

.3 Main idea of cluster analysis

.3 Main idea of cluster analysis