TRENDIFY SEGMENTATION SOLUTIONS
This is the forth article in our series . We will highligt about importance of segmentation for business.
IV. What is key points in Product / SKU segmentation for business ?
- What is the working principle of Product Segmentation?
- Evaluation of segmentation results for business
IV. What is key points in Product / SKU segmentation for business ?
In this article, we will basically talk about the types and working principles of clustering (segmentation) algorithms used by Trendify. In general, there are 4 types of clustering algorithms. These are;
Centroid Based Algorithms
The purpose of center-based clustering is to find the centroids of clusters according to their elements. The Kmeans algorithm is one of the most well-known center-based clustering algorithms. Here k is the number of clusters and is a hyper parameter of the algorithm. The basic idea behind the algorithm is to find k sets of points grouped by their distance to the center, and then find k centroids, so that the square of the distances of the points in the cluster from the center is minimized.
Another important point is to find the first center points. The K-means++ algorithm is preferred for the determination of the first center points. K-means++’ intelligently selects the first cluster centers for the k-means clustering algorithm to accelerate convergence.
Density Based Clustering
Density-based clustering links areas with high sample density to clusters. This allows for random distributions as long as dense areas can be connected. The basic logic of the algorithm is to include all elements with a certain distance and a certain density in the set. In this type of algorithm, unlike the K means algorithm, it is not necessary to specify the cluster numbers in advance.
Among the density-based clustering methods, the most well-known algorithms are DBSCAN and OPTICS algorithms. It works with epsilon and minimum sample parameters in two algorithms. Epsilon represents the maximum distance specified for that cluster. The minimum sample parameter represents the smallest number of elements determined for a cluster. In the picture below, cluster boundaries were drawn according to the determined epsilon parameter and core, border and noise points were determined. Core points are those closest to the cluster center. Border points are points on the border with respect to two or more clusters. These points are again assigned to a cluster based on distance. Noise points are determined as discrete values.
Distribution Based Clustering
This clustering approach assumes that the data consists of distributions such as Gaussian distributions. The Gaussian distribution (also known as the normal distribution) is a bell-shaped curve and values are assumed to follow a normal distribution during any measurement, with an equal number of measurements above and below the mean.
In the Image 4, the distribution-based algorithm clusters the data in three Gaussian distributions. The greater the distance from the center of the distribution, the less likely a point will belong to the distribution. In the picture below, the bands also show that the probability is decreasing. The best example of distribution-based algorithms is the Gaussian Mixture algorithm. It takes the number of clusters as a parameter, as in the K-means algorithm.
Hierarchical Clustering creates clusters in a hierarchical tree-like structure (also called a Dendrogram). According to the hierarchical clustering logic, there are two basic approaches: agglomerative and divisive. In the agglomerative approach, also known as connective (bottom up), all objects are initially separate from each other. In other words, each of the available data is considered as a separate set and the work is started. Then, clusters with similar attributes are gathered together to form a single cluster. In the divisive (top bottom) approach, on the other hand, unlike the connective method, a discriminatory strategy is dominant. In this approach, there is only one cluster at the beginning. At each stage, objects are separated from the main cluster according to the distance/similarity matrix, and different subsets are formed. As a result of the process, each data becomes a set. An example of hierarchical clustering is given in the Image 5.
One of the most important parameters taken by the hierarchical clustering algorithm is the linkage parameter. Some of the linkage types are single linkage, complete linkage or average linkage. It aims to combine the two closest structures or clusters by making use of the single linkage distance matrix. In Complete linkage, the merge process takes place by taking into account the greatest distance between the structures. Average linkage, on the other hand, is the merge process that takes into account the average value of the distances between the data in two structures.
Evaluation of segmentation results for business
For which variables the created segments stand out should be interpreted and these variables should be compatible with real life. Model results cannot be directly integrated into processes by only looking to the score. If the results have harmony and meaning for the business, they should be integrated into the processes. Making sense for business is possible by making the results understandable and interpretable.
Segmentation is an unsupervised learning technique. This technique is used in cases where the labels and effects of the data are not known due to its nature. It aims to form groups that need to be gathered together by looking at the common aspects of the data. Therefore, as in the supervised learning technique, it is not possible to control how far a result was produced from actual results. controls can be made with criteria such as Accuracy, F1 Score, MAE, etc. in classification but in segmentation couldn’t.
For example, let’s consider the problem of identifying the profiles of people who come to watch a basketball game. The problem here is about the type of people who come to watch the match (age, gender, interests, etc.), how many of the people in similar groups make up the total audience, and what are the prominent features of the groups. That is, the aim is to reveal an unknown situation. Therefore, visualizing and explaining the results in a way that can be interpreted becomes the most important issue.
The following methods can be preferred for interpretation and understanding of segmentation results;
- Dimensionality reduction
- Graphic visualization
- Box Plot
Graphic visualization is one of the most used techniques. It helps you understand the structure of the data and see how well the resulting clusters decompose. Many types of graphs can be used for data visualization. These are scatter, histogram, bar, box, pie plot etc. In the Image 6, you can see an example made with the PCA dimentionality reduction method. The PCA method reduces the multidimensional data to small dimensional data, allowing us to represent the data more appropriately and also reducing the size of the data. In the image, normally, a data consisting of 9 different features was made 2 dimensional using the PCA algorithm and the clusters formed with a scatter plot were shown. In this case, it can be observed that the clusters are separated from each other as well as possible.
One of the graphic visualization techniques is histogram or pie chart. While you can easily see the distribution of numerical data with the histogram, the distribution of categorical data can be easily seen with the pie chart. In the image below, it is possible to clearly see the distribution within the segment and in the general situation for each variable. Blue histograms show the intra-segment histogram, while red histograms show the overall state histogram. Since the purpose variable is a categorical variable, the pie chart near the center shows the intra-segment distribution, while the pie chart far from the center shows the distribution in the general situation. Thus, it can be easily interpreted in which segment, which variable is more differentiated from the general.
Each cluster created is obtained by looking at the distribution of features. In order for a feature to express or explain that cluster, the distribution of that feature must also be smooth. What is meant by distribution here is that it provides a statistically uniform or nearly uniform distribution. One of the most common measures is standard deviation and mean. The standard deviation is the square root of the arithmetic mean of the squares of the difference of the data from the mean. The standard deviation shows how spread out that feature is. In other words, values that diverge too far from the cluster average cannot express the cluster well. Therefore, features with a high standard deviation on a cluster basis should not be evaluated in interpreting that segmentation. A second variable is the mean. The mean is obtained by dividing the sum of the data by the number of data. Thanks to the average, we can understand how much the feature is important and how it differs from other features. The same is true for the low average.
In the Image 8, box plots were created for a data with 4 clusters and 3 features. In the graphs, each cluster was evaluated on the basis of feature and the box plots were created showing the quarter range, median and standard deviation. According to this graph, features that do not have a high standard deviation and have a high mean should be considered in the cluster interpretation. Other features may not be used to evaluate the cluster.
In this article, we talked about the interpretation and evaluation of segmentation results. While evaluating the results, we used dimensionality reduction, visualization techniques and box plot analysis. Thanks to these techniques, we made the segments more explainable and interpretable. Remember that it is very important to evaluate the segmentation results as well as to obtain them. Each segment that we evaluate correctly will add value to us in terms of business.
Thanks for reading
Date : 13.01.2021
Author : Mustafa Gencer (Data Scientist , TRENDIFY)