# Customer Segmentation via Machine Learning

Updated: Jul 28, 2020

Depending on the amount of data you have and the number of variables you want to focus on in your analysis, cluster analysis may be the perfect tool for you. Especially when it comes to improving your user experience, cluster analysis can help you discover the pattern of people you are trying to market to by pulling in more data and creating an unbiased picture of your digital property. Simply put, it’s a way to discover insights from your data based on groups that are formed.

## An Introduction to Cluster Analysis

So how is cluster analysis different from segmenting? Think of segmenting as an umbrella term, with cluster analysis as one of many ways to segment your data. Cluster analysis is a tool that can segment your users based on behavior/tendencies, interests, age range, and more. The algorithm used is based on machine learning principles, but you have the option to define how many groups you want. Depending on the method, you can pick a number of desired clusters, or let the algorithm figure out how many distinct groups you have within your data.

**Big Data Project**

# Data Analysis about

# “Individual household electric power consumption”

**Lifecycle**

The problems definition steps and collecting data have already been performed.__DEFINE PROBLEM AND DATA COLLECT-__- Normalization process and missing values treatment has been applied before. Anyway, as we find only 1,25% missing values, this does not generate much interference.__PREPARE DATA__– For all models, we apply the training procedure (70%) and test (30%)__TRAIN MODEL__- The k value minimum considered is 70%.__EVALUATE__– This process considers future applications and adjustments, considering the entry of new future data.__DEPLOY AND IMPROVE__

__Problem Description __

The objective of this project is to identify patterns of consumption behavior based on historical data on electrical energy consumption by applying the machine learning method in order to support making decision in the formulation of public energy policies.

To solve this case, we will be using the most widely used unsupervised machine learning method called k-mean.

K-mean is a very intuitive solution and aims to group the dataset and apply a clustering technique, returning the similarities identified with the patterns within the attributes of the dataset, being extremely useful, because it allows the automatic search for patterns that are not perceived The value of k . We can define the k value considering the business rule and the personal evaluation based on dataset knowledge. The key point to the success of any predictive modeling project is in the pre-processive and treatment phase of dataset.

__Data Set Information__

This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months).

__Attribute Information__

1.date: Date in format dd/mm/yyyy 2.time: time in format hh:mm:ss 3.global_active_power: household global minute-averaged active power (in kilowatt) 4.global_reactive_power: household global minute-averaged reactive power (in kilowatt) 5.voltage: minute-averaged voltage (in volt) 6.global_intensity: household global minute-averaged current intensity (in ampere) 7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). 8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. 9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

The k-means application pre requisite is that the entire database to be used is in numerical form, for this purpose follows a small sample of this verification.

Otherwise, we must to convert the dataset for a numeric format before start to apply the k-means model.

**Pre-processing Steps**

**About the Missing Values**

The data does have missing values about **1.25%** of the rows (about 82 days). We can see below some dates where we have missing values.

We can notice that the number of missing values is relatively small compared to the size of the database. So, it’s not harmful to the model if we delete these missing values.

```
Python code :
power_consumption = data.iloc[0:, 2:9].dropna()
```

__Training and test the datasets Process__

As we work with datasets, a **machine learning algorithm** works in two stages. We usually split the data around 30%-70% between testing and training stages.

```
Python code :
pc_toarray = power_consumption.values
df_treino, df_teste= train_test_split(pc_toarray, train_size =.01)
```

__Dimensionality reduction__

Applies dimensionality reduction, we are not reducing the numbers of variables

we only collected all variances from the variables and input it in 2 components

and these components represents the same variables information

so now we can work with only these 2 components to make predictions.

```
Python code :
hpc = PCA(n_components = 2).fit_transform(df_train)
```

__Building the model__

Now we apply the fit in the object calls hpc which was transformed before, with the reduced dimensionality.

```
Python code :
k_means = KMeans()
k_means.fit(hpc)
```

As we did not choose the parameters for k, the k_means() chose n_cluster=8 as a default, how we can see bellow.

**Organizing the Data Cluster**

Now we are going to organize the data to get minimum and maximum values and arrange shape to plot it and to build the graphic.

__Python code ____:____ __
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

__Area Clusters Plot__

This second process used k_means with k = 8.

Each color area represent a cluster area.

__Python code ____:____ __
plt.figure(1)
plt.clf()
plt.imshow(z,
interpolation = 'nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Paired,
aspect = 'auto',
origin = 'lower')

**Plot of each centroid’s clusters**

The centroid or geometric center of a flat figure below is the arithmetic mean position of all points in the figure. The red X represent the 8 centroids clusters and we have some distant points that could not be identified. It is not a clustering mistake but show us that these points do not have similarities with the other groups, so it’s could be out liers or wrong data inserted to dataset.

So, in this case we should to discuss about it with the business area to be sure why these points are different.

__Python code ____:____ __
plt.plot(hpc[:, 0], hpc[:, 1], 'k.' , markersize = 4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x' , s = 169, linewidths = 3, color ='r', zorder = 8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

__Determining a range for K__

In this specific case is possible to run this process, but in a big data case it would be hard to process it because it will spend to much time to run different k values.

__Python code ____:____ __
k_range = range(1, 14)

__Appling K-Means model for each K value __

The objective is to see the different results and identify the best k value using a list comprehension, running in one command different operations, creating a model and fitting it for each k value inside of the range.

__Python code ____:____ __
k_means_var = [KMeans(n_clusters = k).fit(hpc) for k in k_range]

__Adjusting the centroids cluster for each model__

The objective in this code using a list comprehension is collect the centroids for each model with different k values.

__Python code ____:____ __
centroids = [X.cluster_centers_ for X in k_means_var]

__Calculating the Euclidean distance__

For each centroids value listed before, we calculate the Euclidean distance (or metric distance ). The Euclidean distance is the distance between two points which can be proved by repeated application of the Pythagorean theorem.

__Python code ____:____ __
k_euclid = [cdist(hpc, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke,axis=1) for ke in k_euclid]

__Sum of the squares of distances inside of cluster__

__Python code ____:____ __
wcss = [sum(d**2) for d in dist]
Total Sum of the squares
tss = sum(pdist(hpc)**2)/hpc.shape[0]

__Total Sum of the Squares - Sum of the Squares of distances inside of clusters__

__Python code ____:____ __
bss = tss - wcs

__Elbow Curv__

The next step is calculate the variance resulted of the Total Sum of the Squares - Sum of the Squares of distances inside of clusters. From this result we know that for each k value we can understand the variance value

__Python code ____:____ __
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(k_range, bss/tss*100, 'b*-')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Numbers of clusters')
plt.ylabel('Variance Percent Explicated')
plt.title('Explicated Variance x K Value')

The *Explicated Variance x* *K Value* graphic show us that for k value >= 4 the variance could be more explicated

__Creating a new clustering model with k = 7 .__

__Python code ____:____ __
k_means = KMeans(n_clusters = 7)
k_means.fit(hpc)

__Get the Minimum and Maximum Value and arrange the Shape.__

__Python code ____:____ __
x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1
y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

__Area Clusters Plot.__

__Python code ____:____ __
plt.figure(1)
plt.clf()
plt.imshow(z,
interpolation = 'nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Paired,
aspect = 'auto',
origin = 'lower')

__Plot of each centroids clusters.__

__Python code ____:____ __
plt.plot(hpc[:, 0], hpc[:, 1], 'k.' , markersize = 4)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x' , s = 169, linewidths = 3, color ='r', zorder = 8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

There was not much difference in outcome between the 7 and 8 cluster models.

Therefore, the cluster number is not always the most important in the clustering process,

but rather the interpretation collected from the model result.

Therefore, in-depth knowledge of the data and the business problem becomes important.

In this case, the clustering model application becomes the fastest part process

and the other parts of the process are those that require the most time

and attention to get the best results.

` `__Python code ____:____ __
?silhouette_score

The best value is 1 and the worst value is -1. Values close to 0 indicate overlapping clusters, meaning that they belong to more than one cluster.Negative values usually indicate that a sample was assigned to the wrong cluster because a different cluster is more similar.

### Silhouette_score

__Python code ____:____ __
labels = k_means.labels_
silhouette_score(hpc, labels, metric = 'euclidean')

## What are the benefits of customer segmentation?

**1. IT ALLOWS YOU TO FINE-TUNE YOUR MESSAGE**

When you segment your marketing efforts to specific groups of people, it allows you to hone in on specific messages that you want to advertise.

It enables you to fine-tune your marketing message to align with exactly what the recipient is looking for, and therefore, increases the chances that they’ll convert.

**2. INCREASE YOUR REVENUE**

By fine-tuning your marketing message, you’ll see increases in your revenue because users will be more likely to make a purchase when they’re delivered exactly what they need.

In fact, segmented and targeted emails generate 58% of all revenue for a company, which isn’t hard to believe.

When you segment your emails, you’ll also have a subject line that that is personalized to the recipients’ needs, which can increase open rate by 26%. And it’s obvious that the more emails are opened, the more sales you’ll make.

**3. YOU’LL INCREASE AWARENESS FOR YOUR BRAND**

Users love personalized emails. When they receive something that’s made just for them, they’ll feel more comfortable purchasing from your company because they know you care.

This also builds brand awareness because customers will remember you as the company that sends them emails based on their interests, previous spending habits, and more.