TRENDIFY SEGMENTATION SOLUTIONS

V. What are segmentation algorithms ?

  • Segmentation Algorithms
  • What are the factors that are important in creating a segment? 

V. What Are Segmentation Algorithms ?

In this article, we will talk about some clustering algorithms. After explaining the working structure and logic of the algorithms, we will show the algorithms by giving an example. 

K-means 

K-means algorithm is a centroid based clustering algorithm. The main purpose of the k-means algorithm is to try to divide the data into k optimum clusters by calculating the square of the distances of each point from the cluster centers. One of the problems here is to determine the cluster centers in the first iteration. The K-means++ algorithm is preferred for the determination of the first center points. K-means++’ intelligently selects the first cluster centers for the k-means clustering algorithm to accelerate convergence. 

Example iterations of the k-means algorithm are shown in the picture below. In the first iteration, cluster centers were determined according to the kmeans++ algorithm. In the second iteration, each element was assigned to a cluster center according to the cluster centers determined, and new cluster centers were recalculated according to the assigned cluster elements. Each iteration after that continued in this way. It continued until a certain number to complete the iteration, and when it came to the last iteration, the clusters were determined. 

Now let’s implement the kmeans algorithm with python. 

In the above snippet, we first used the make blobs method in the sklearn library to create a sample data. Then, we performed the fit operation to divide the data we produced into clusters with the Kmeans algorithm. We plotted the clusters and cluster centers formed as a result of the fit operation with the help of the matplotlib library. Cluster elements are visually separable and easily separated with the help of k-means since they are low-dimensional. 

Dbscan 

Dbscan algorithm is one of the density-based algorithms. The working logic of the Dbscan algorithm is on assigning a certain number of data to the same cluster in a certain area. In this algorithm, there is no need to specify the number of clusters before the fit process, as in the K-means algorithm, since everything is determined according to the density. 

There are two important parameters that the dbscan algorithm takes. One of them is epsilon (ε) and the other is the min_sample parameter. Epsilon parameter expresses the distance from the center required to determine the cluster elements. The min_sample parameter, as the name suggests, refers to the minimum number of elements that the set should take. Thanks to the min_sample parameter, elements less than the minimum sample number and close to each other are not assigned to the same set. Elements that are not assigned to a cluster after the clustering process is completed are evaluated as outliers. 

The picture above shows how the Dbscan and Kmeans algorithms work on the same datasets. It is observed that clustering with Dbcan is better in such data sets. Because there are dense areas together in the data sets. That’s why it’s not always the case to make a comment like one of them is good. If the areas of data sets with similar density are gathered together as above, using the Dbscan algorithm may be a logical option. 

Now let’s implement the dbscan algorithm with python. 

As seen in the graphic in the picture above, clusters with dense areas were successfully detected by the DBSCAN clustering algorithm. 

Gaussian Mixture Models (GMM) 

Many data sets can be modeled with the Gaussian Distribution. Therefore, it is quite common to assume that the clusters in the datasets come from different Gaussian Distributions. That is, GMM is the model that describes the data set as a mixture of k Gaussian Distributions, under the assumption of normality. This is the main idea of this model. That is, assuming that the sources of the given samples are k Gaussian distributions, it is the optimization of the Gaussian parameters of these sources to maximize the probability density function of the mixture (Expectation-Maximization – EM). Thus, GMM is successful even where it is assumed that the data set is generated from a single distribution and modeling algorithms get stuck. The probability-based approach gives the probability that each sample in the data set belongs to all clusters. It doesn’t make the clustering circular like the commonly used clustering algorithms (eg k-means). Clusters are more prone to ellipses. 

Image 5: Example of K-means and GMM clusters 

Now let’s implement the GMM algorithm with python.

We used the highly preferred wine dataset for clustering with the GMM algorithm. Before we fit the data, we performed Standard Scaling. 

Agglomerative clustering 

Agglomerative clustering algorithm is a hierarchical clustering algorithm. In Agglomerative clustering algorithm, also known as inductive (bottom up), initially all objects are separate from each other. In other words, each of the available data is considered as a separate set and the work is started. Then, clusters with similar attributes are gathered together to form a single cluster. As a result of the process, each data becomes a cluster. One of the advantages of this algorithm is that it is not necessary to specify the number of clusters. In addition, trees called dendograms produced as a result of the algorithm help us to understand the data and split points. 

One of the most important parameters taken by the agglomerative clustering algorithm is the linkage parameter. Some of the linkage types are Single linkage, complete linkage or average linkage. It aims to combine the two closest structures or clusters by making use of the single linkage distance matrix. In Complete linkage, the merge process takes place by taking into account the greatest distance between the structures. Average linkage, on the other hand, is the merge process that takes into account the average value of the distances between the data in two structures. 

Now let’s implement the Agglomerative clustering algorithm with python. 

The picture below shows the results of different clustering algorithms in data sets. Each algorithm was adjusted to find the most suitable parameters in that data set. The visual results here can help us to have an idea about algorithms and to choose algorithms according to data types. Although this picture is good as a preliminary idea, we should be more careful with our own data. 

What are the factors that are important in creating a segment? 

There are many things to consider when preparing your data for product segmentation with machine learning. These are; 

Filling missing values in the data 

First of all, it is necessary to understand correctly what the missing data is. Not every missing data means missing data. Some data is inherently non-existent and filling it out makes the result meaningless. Some data are not available because the data was not collected, and filling these data by looking at other data helps to make sense of the result. For the two cases mentioned, different approaches and methods should be applied. 

Let’s say you are doing a product segmentation. Suppose one of the variables is Old Season Sales. We can’t expect this variant to be full for products that weren’t in the product list in previous seasons. Because the product is not available in that season. Therefore, when we accept this as missing data and fill it in by looking at the quantitative values of the other data (mean, median, mode, KNN Imputer), we will give a biased result. In this case, data labeling/separation techniques can be used to label the missing data and separate it into a separate group from the others. With this technique, the aim is to collect missing data as far apart as possible from existing ones. 

Suppose there are products in the same product segmentation with an empty Old Season Sales variable and products from past seasons. In this case, there are gaps due to the lack of data collection. Filling these gaps with other filled data will be beneficial in making sense of the result. As the filling method, it can be decided according to the type of data and the orientation of the business information. There are many ways to handle missing data. Some of those are, 

  • Mean 
  • Mod 
  • Median 
  • KNN Imputer 

As seen in the example above, some values are missing in the columns. One of the mentioned methods can be used to complete these values. If the data consists of numerical values, the mean or median can be preferred. If the data consists of categorical values, the mode can be preferred. There are also many different options for filling in missing data. Some of these can be given as an example using the model or the KNN Imputer method. In the KNN Imputer method, it tries to fill in the missing data by looking at the k nearest neighbors. 

Detection of outlier values 

Outliers are points in a data set that can be negative or positive from most of the observations and constitute distinctiveness in the dataset. Outliers can be informative and valuable in many cases. At the same time, they may even point to insufficient data to demonstrate the accuracy of the model, which creates errors and increases the complexity of statistical calculations. 

Various factors lead to the occurrence of outliers in a given data set. These are manual error or experimental errors. Manual error is one of the most common types of errors seen in large datasets, as the data fed into the system is very large and this type of human input data is often prone to manual errors. Experimental errors in the extraction, implementation, and final implementation of the dataset can cause outliers. One of the most common methods for finding outliers is the z score. 

Z Score 

Z score is used to calculate the distance of the average data points calculated in the given data set using normal standard deviation. By default, the mean of the data is assumed to be 0 and the standard deviation is assumed to be 1. The center value is then rescaled with the derived mean and the standard deviation calculated based on the given dataset. 

If a data point’s z-score is greater than 3 standard deviations, it indicates that the data point is quite different from other data points. While you’re doing the outlier elimination, iterative outlier elimination is very helpful. When taking the mean and standard deviation of a population, the mean and standard deviation will be biased since the outlier will be inside the data. 

  1. Looking at the largest and smallest values in the data 
  2. Determining the mean and standard deviation values 
  3. Determine those with X < μ – 3 σ or X > μ + 3 σ as outliers 
  4. Repeating the process starting from step 1 after deleting the outlier data from the data. 

While doing this iteration, it is important how much of the data is determined as the outlier. The chart below shows the percentages of the data according to the standard normal distribution and standard deviation values. 

InterQuartile Range 

One of the ways to find outliers is to use the aperture of quarters. The quartile range is to find values with 0.25 intervals by ordering the data from largest to smallest. What is important here is the first (Q1) and third (Q3) quarters. When we subtract Q1 from the value of Q3, we find the gap of quarters. IQR = |Q3-Q1| we can say. 

To detect outliers; As the upper limit, we can specify the sum of the Q3 value with 1.5 times the IQR and as the lower limit the Q1 value minus 1.5 times the IQR value. So, to talk about the box plot below, approximately -20 and 20 means Q1 and Q3. The quarters aperture is 40. 1.5 times the quarters gap is 60. Therefore, since -20 – 60 = -80 and 20+80 = 100, -80 and 100 will be our outlier limit. 

Feature Selection 

Feature selection is the process of separating the most consistent, unique, and related features to be used in model construction. As the size and diversity of datasets continues to increase, it is important to reduce the size of datasets. The main purpose of feature selection is to improve the performance of a predictive model and to reduce the computational time of modeling. 

The presence of redundant features in the dataset can reduce the quality of learning, and removing these features can result in more memory and computation time. In terms of clustering, removing irrelevant features increases storage and reduces computation time. It will also not adversely affect clustering accuracy. On the contrary, it will contribute positively to clustering success. 

Feature selection can be made before clustering operations by using correlation between features, size reduction, different mapping operations on features or statistical techniques. In our next article about feature selection, we will convey these important points that need attention. 

Evaluation Metrics for Clustering 

It is necessary to evaluate the accuracy of the product groups that emerge as a result of product segmentation, known as clustering in machine learning and segmentation in business analytics. Since product groups are determined unsupervised (without labels), it may not always be possible to evaluate their number and distribution sharply. Therefore, an evaluation metric is needed. By looking at this evaluation metric, it is examined how homogeneously and uniformly the clusters formed are distributed. 

For example, one of these metrics is the Silhoutte score. This score ranges from -1 to 1, and the higher the score, the better defined and distinct your clusters are. In general, it controls the proximity of the clusters to each other and the distance between the clusters. The Silhouette score formula is shown below. In the given formula, a(i) expresses the mean punctuation distance of any point i. b(i) expresses the average distance of any point i from the nearest cluster elements outside the cluster. 





  Image 16: Silhouette score formula 

One of the evaluation metrics is the Calinski-Harabasz index. Calinski-Harabasz score, also known as the variance ratio criterion, is the ratio of the sum of the intercluster distribution and the intercluster distribution for all clusters, the higher the score, the better the performance. The score is higher when the clusters are dense and well separated. It is fast to calculate score. 

Another evaluation metric is the Davies-Bouldin score. This score indicates the mean ‘similarity’ between clusters; where similarity is a measure that compares the distance between clusters with the size of the clusters. A lower Davies-Bouldin score is associated with a model with better discrimination between clusters. Davies-Bouldin’s calculation is simpler than Silhouette score. 

Identifying Niche Segments 

Finding niche segments under the general population is an important issue in segmentation problems. Data points that are similar in nature to each other are located in the same clusters, but besides the intense similarities, less intense observations can be overlooked. In this case, niche segments that require more specific action may also be overlooked. Trendify has developed a new algorithm that can also take niche groups into account. 

Trendify Segmentation Algorithm is an algorithm that aims to better segment clusters. In order to obtain a better result by using the tried machine learning models, it tries to obtain the best result by iteratively testing the models on the given dataset. At each stage, a segmentation is made according to which model has the highest success. If the product group is not sufficiently differentiated, the model works again and segmentation continues until the groups reach the optimal level. Clustering score evaluation metrics are used for the success of the model. Thanks to the models selected at each stage, more efficient and effective results are obtained than classical clustering algorithms, and a result is given in a way that can also see niche segments. 

The picture below shows the graphs of a sample data clustered with the classical clustering algorithms and the Trendify Segmentation algorithm. The graphs of the clusters were drawn after the multidimensional data was made two-dimensional with the PCA algorithm. In the first graph on the left, the data was divided into 3 clusters. The data clustered according to the Trendify segmentation algorithm on the right was divided into 6 clusters. The 1st cluster in the left graph was divided into 3 more clusters because it was very dense and more numerous than the data itself. Likewise, the second cluster was divided into 2 separate groups because it was dense and large. Thus, both the data and more niche segments have been reached, and the data has become more separate and better observable. 

Trendify Product Segmentation algorithm tries to reach the best segmentation result by iteratively trying different clustering models and hyperparameter optimizations. It compares the tried models according to the success scores at each stage and chooses the best model for you. After model selection, it evaluates the data obtained according to the re-segmentation criteria and re-segments the necessary clusters. This process continues until the most optimal clusters are found. Thus, thanks to the Trendify Product Segmentation algorithm, you get the most efficient and most interpretable niche segments. 

Thanks for reading

Date : 20.01.2021

Author : Mustafa Gencer (Data Scientist , TRENDIFY)

Related Blog Posts
Product / SKU Segmentation 4

IV. What is key points in Product / SKU segmentation for business ?

Product / SKU Segmentation 2

How Do Business Use Product / SKU Segmentation ? - Inventory Management - Forecasting Ideal Stock Level - Pricing Strategy Read more

Product / SKU Segmentation 1

You can manage your processes in the most effective way by making a correct product segmentation. What is Product / Read more

Trendify

Trendify