VI. What is the key points of product segmentation for Data Scientist?

  • Feature Selection for Segmentation Algorithm
  • Monitoring Results for Segmentation Algorithm

Feature Selection for Segmentation Algorithm

Choosing the right feature is of great importance in the product segmentation process. There is a direct correlation between choosing the right feature and making the best modeling in product segmentation and achieving the most logical results. Features that have relation with each other tend to mislead the generated segmentation model. Finding and excluding features that are directly proportional to each other helps the model to run faster, saves storage, and also helps the model create better segments.

There are differences between supervised learning and unsupervised learning for feature selection. Since supervised learning is done with labeled data, in feature selection, the relationship between the input data and the target data is usually examined, and whether that feature is necessary or not is decided by looking at this relationship. Since there is no target data in unsupervised learning, a feature selection is made by looking at the correlation of the features among themselves. Now we will examine the correlation between features and feature selection methods in unsupervised learning methods.

Correlation between features

In order to understand the correlation between features, the distributions of each variable in the data must be well known. Some statistical measures are used to understand whether there is a relationship between these known distributions. These are variance, standard deviation, covariance and pearson correlation coefficient.

Variance is a measure of the variability or spread in a data set. We use the following formula to calculate the variance of a variable. X is the value of each point in the data, μ is the mean, and N is the number of elements in the data. Thanks to the variance, it is understood to how the data is spread. The standard deviation is the square root of the variance. Thanks to the standard deviation, the distribution of the data in its own scale is dominated. If one of the standard deviation or variance values is 0 or close to 0, it indicates that the variable does not vary much and contains too many values that are the same or close to each other. In that case it would make sense to remove this variable. Because the values with low variability do not contribute positively to the model, but they also increase the model volume by occupying unnecessary space in the model. This means also waste of the time. 

Covariance is a quantitative calculation that shows the extent to which the deviation function of one variable from the mean of the other function matches the deviation from the mean of the other function. In other words, covariance is a criterion that shows the variability of the linear relationship between two variables. The following formula is used to calculate the covariance between variables. Here x and y represent the points of two different variables. A positive result indicates a linear relationship between the variables, a negative result indicates an inverse relationship, and a 0 indicates no relationship. In real-life data, the result of 0 is not often encountered because almost everything is slightly related with each other.

A high covariance does not mean a strong relationship. You can have two features with very different scales, where a small change in the small scale feature leads to a large change in the large scale feature, even if the relationship is relatively modest. If the absolute value of covariance between two features is 1, you can remove one of them. If the absolute value covariance is different from 1, it is better to keep both in the model than to leave one of the variables out.

Both correlation and covariance show the relationship and dependence between two variables.

  • The covariance shows the direction of the linear relationship between the variables when a function is applied to the variables.
  • However, correlation measures both the magnitude and  the direction of the linear relationship between two variables.

In simple terms, correlation is a function of covariance. What separates the two apart is that while correlation values exist, the covariance values are not standardized. The correlation coefficient of two variables can be obtained by dividing the covariance values of these variables by the product of the standard deviations of the given values. In the following part of the article, we will describe one of the correlation types, Pearson Correlation Coefficient, as an example.

The Pearson Correlation Coefficient is obtained by dividing the standard deviations of the two variables by their covariance metrics. When performing any statistical test between two variables, it is always necessary for the analyst to calculate the value of the correlation coefficient so that he or she can know how strong the relationship between the two variables is. In this way, we can reach the degree and direction (positive or negative) of the relationship between two different variables. -1 indicates a perfect negative linear correlation between the two variables. 0 indicates that there is no linear relationship between the two variables. 1 indicates a perfectly positive linear correlation between the two variables.

Image- 3: Pearson Correlation Coefficient Formula

If the value is higher than a certain threshold value according to the pearson correlation coefficient observed among the features, it is possible to better train the model and reduce the data size by excluding one of the variables.

Combining features that we have with PCA

In addition to the correlation between the features, the combination of the existing features with each other is important in order to use the data better. Dimention reduction operations are useful for extracting new size features and shrinking the data, as well as finding patterns that cannot be captured in the data.

Principal Component Analysis (PCA) is one of the dimension reduction methods. PCA is a statistical procedure that allows you to summarize the information content in big data tables through a smaller set of “summary indexes” that can be more easily visualized and analysed. Generally, dimension reduction is used to make data more concentrated and explainable. PCA is used when it is desired to reduce the number of variables but the variables to be removed completely cannot be determined and to avoid overfitting the model.

Now, let’s give an example through a two-dimensional (feature) graph to more easily understand how PCA works. The chart below shows a simple 2-dimensional data with Feature 1 and Feature 2.

Here, the PCA algorithm minimizes the distance between the data points and their projections on the best fit line (blue colored line) with the help of Singular Value Decomposition (SVD). The main function of the SVD algorithm here is to express the vectors in terms of orthogonals of each other, perform the decompose operation and help create vectors in new dimensions. Using these vectors, it determines the most suitable line so that the sum of the distances from all points is minimum. If we look at the graph, the average of the data points of Feature 1 and Feature 2 will be around point A. PCA also maximizes the distance of the projected points on the optimal line from point A. 

To show the maximization process, if we put the points we have and the fit line on an origin point, the distance d1 shows the distance of the projection of the first point from the origin, that is, point A. Similarly, d2, d3, d4, d5, d6 will be the distances from the origin of the projections of the other points. The best fit line will be equal to the sum of the squares of the maximum distances of these distances.

After determining the most suitable fit line, new dimensions are created by drawing a line orthogonally perpendicular to this line.

To give an example of the use of PCA, if the first two features explain 60% of the variance of the data in a data set consisting of 20 features, and the next 5 features explain 35%, we will explain the 95% variance of the data with 7 features in total. This means that the remaining 13 features represent 5% variance. A dimention reduction method using these 7 features will allow most of the data to be expressed. In addition, removing one of the features that are highly correlated with each other will also contribute to the dimention reduction. Thanks to the removed features, storage savings are also provided. In addition, it contributes to the better operation of the model by getting rid of unnecessary features.

Multi-Cluster Feature Selection (MCFS)

Since there is no tagged data in feature selection methods for unsupervised learning, efforts are usually made to remove features that may be useless. In this section, we will examine the Multi-Cluster Feature Selection (MCFS) method, which is one of the unsupervised learning feature selection methods, and its working logic.

MCFS consists of three steps:

  1. Spectral analysis
  2. Sparse coefficient learning
  3. Feature Selection

In the first step, spectral analysis is applied to the dataset to detect the cluster structure of the data.

Then, in the second step, since the embedding clustering structure of the data is known, through the first k eigenvectors of the Laplacian matrix, MCFS measures the importance of the features with an arranged regression model. In the third step, after solving the regression problem, MCFS selects the desired number of features based on the highest absolute values of the coefficients obtained from the regression problem.

Spectral Analysis is done using the eigen value and the laplacian vector. The following pictures show the Laplacian formula and the generation of an example Laplacian matrix. The Laplacian graph is the matrix L = D – A; A is the adjacency matrix and D is the diagonal degree diagonal matrix. To produce the degree matrix, the number of connections of each data point is checked and these connection numbers are written diagonally to the degree matrix.

Image 9. Laplacian Function

To find the Eigen value |A – Iλ| = 0 formula is used. A here nxn represents a matrix, I represents the unit matrix, and λ represents the eigen value. After eigen value is obtained from this equation, eigenvectors are obtained. In order to obtain the eigenvector using the Laplacian matrix, we can use the Laplacian matrix instead of the A matrix. Using these eigen vectors obtained in the last step, the features are sorted.

You can check the article where the MCFS method is recommended.

In this article, we have explained the elements to be considered while selecting the feature in the clustering method, which is one of the unsupervised learning techniques. We have explained the methods of feature selection, dimension reduction, and feature selection according to the correlation between features. In the following article, we will explain how to interpret the results.

How should clustering results be interpreted?

After the algorithms are selected according to the performance metrics, the interpretation of the algorithm results is especially important in unsupervised models. Because people’s interpretation of clustering results can be as challenging as building a model.

In order to interpret the clustering results, it is important to evaluate the distribution of the features on a cluster basis and to identify the prominent features that are more or less prominent than other features. In order to determine the distribution of features on a cluster basis, the Trendify clustering algorithm has some unique evaluation rules. Thus, the meaning of clusters is automatically determined in Trendify Clustering results, and direct integration into business processes is ensured.

Two rules are important in profiling and interpreting the results;

  • Seperation from the overall mean

When checking the deviation from the mean for each feature, determined threshold values are taken as a basis. For each feature, the average values above the threshold value and the average values less than the negative threshold value are important for that cluster. Thus, according to the threshold value, it is determined whether that feature represents that cluster or not.

Image-11: General Average Decomposition with Threshold
  • Coefficient of Variation Ratio

The Coefficient of Variation ratio for each feature on a cluster basis must be less than the threshold value determined by the distribution. Because when this ratio is higher than the threshold value, that feature cannot express the selected cluster well. In other words, when the distribution of that feature increases in line with the general cluster mean, the feature expresses the cluster better. With this in mind, it is also important to evaluate the Coefficient of Variation feature distributions.

Image-12: Coefficient of Variation Ratio Formula with Threshold
  • Jaccard Index

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used to measure the similarity and diversity of sample sets. Developed by Paul Jaccard. The Jaccard coefficient measures similarity between finite sets of samples and is defined as the size of the intersection divided by the size of the union of sample sets.

Image-13: Jaccard Similarity Index Formula

Jaccard Index can be used to find out how stable the method works in clustering algorithms. When the algorithms are run iteratively, the clustering results obtained are compared with each other. For comparison, the Jaccard index is used to find how similar the clusters are to each other each time. In this way, it is concluded that algorithms that produce results close to each other in each trial are more stable for that data.


  • Interpreting Cluster Profiling Results with Sample Data

In this section, the sample is analyzed with Trendify Segmentation, using the data of products in textile retail. The resulting profiling results are analyzed in this section. You can find Trendify demo data here.

Dataset has been prepared by feeding the sales stock movements of the products in 1 season of a brand that has approximately 100 stores in textile retail and also carries out e-commerce activities from its own website. Due to data privacy, each variable in the data has been changed according to a certain coefficient.

The data consists of 8 features and 8623 rows. Variables and their explanations are as follows;
Product ID: Unique value assigned to each product

Cumulative_Sales: Cumulative sales values of the product in the season

STR: Total number of sales / Shipment to stores

Revenue: Revenue from product

shelf_life: The shelf life of the product

Recency_Product: The time between the last sale of the product and today

Profit: Profit from the product

Sales_deviation: The standard deviation of the weekly sales of the product

Warehouse_Stock: Warehouse stock of the product at the time

As a result of running the Trendify Algorithm on the sample data, 11 clusters were created and the prominent features of these clusters are automatically given in the cluster profile column. Results are also available at Trendify analytic demo data. The separation value of each variable from the general average is shown on a cluster basis.

Negatively named clusters, which are Cluster –2 and Cluster –1 refer to outliers. At first, the outliers are determined in the Trendify data preparation module, and then the most appropriate partitioning is performed. In the outlier clusters, Cumulative Sales, STR and Sales deviation features were more prominent than the non-outlier clusters. In these 3 features, the other 100% is more than the average. Profiling is also shown as the other 2 properties, Recency and shelf_life, are higher than the threshold. If we need to evaluate the outlier clusters, it can be said that the cumulative sales values are high, the dispatch rate is high, but the deviation in the sales, that is, the variability, is also high, due to the high recency, it can be said that they are the products with sales in the old seasons. If there are products that have started to be sold in the active season, it should be investigated why they have not been sold for a long time.

For example, if we examine Cluster 0; shelf_life and Recency Product are variables that are approximately 50% lower than the overall average and Sales deviation and Warehouse Stock are 150% higher than average. In addition, the STR value is about 70% higher than the average. Since this cluster is a group with a low shelf life, it is the products that are newly shipped to the stores and have good sales performance. If the sales deviation is high, it can be said that it is suitable to be kept in depth in the stores, since the central warehouse stocks are also high, the shelf life can be controlled with the depth in the store stocks.

Now let’s examine Cluster 4. All variables expressing Cluster 4 in profiling ie Cumuative_Sales, STR, Revenue, shelf_life, Recency_Product, Sales_deviation, Warehouse_Stock are all about 50% or more lower than average. This product group is also the product group that has just entered the store stocks, but unlike the previous group, the sales performance is low. Since the stocks are low, only the stores with high sales performance should be directed to the stores and orders should not be placed for this product for the season.

Finally, let’s examine Cluster 5. Cumuative_Sales and Revenue variable are about 150% higher than average across the cluster. shelf_life and Recency_Product are around 61% higher than the overall average. Sales deviation is 87% higher than average. According to these characteristics, the cumulative sales and income of this cluster are higher than the average. It is a product group with a high shelf life and a high standard deviation, although not as much as outlier. A high recency value means investigating the reason for not selling for a long time. If the center is out of stock, an in-season order can be placed, or a transfer within the cluster should be run, and products that do not sell should be shipped to stores that sell products.

In this article, we have explained the important elements that should be considered in the profiling of the clusters formed as a result of clustering algorithms. In addition, we examined the clustering results with sample data as a result of the Trendify Segmentation Algorithm. We showed the prominent features in the resulting clusters and the inferences made according to these features. Product groups created correctly as a result of the Trendify Segmentation Algorithm reveal the necessary actions to be taken in business life. With Trendify Segmentation Models, you can have the power to control and advance your business with the right product groups.

The Author: Mustafa Gencer

Publishing Date : January 27,2022

Related Blog Posts
Product SKU segmentation 7
Product / SKU Segmentation 7

Trendify Segmentation Product Demonstration 

Product / SKU Segmentation 5

What Are Segmentation Algorithms ?

Product / SKU Segmentation 4

IV. What is key points in Product / SKU segmentation for business ?