Unsupervised Machine Learning for Big Data via Apache Mahout

Updated: Sep 1, 2020


Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set without pre-existing labels and with a minimum of human supervision, where we have only the inputs and the algorithm generates the outputs by performing an analysis of clusters, subsequently enabling the application of supervised machine learning.


Our goal is to create an experimental laboratory to carry out an analysis of news texts with Big Data volume and to identify and group in groups by similarities in HDFS that support Big Data.


We will use our Hadoop ecosystem with our Apache Mahout properly installed for our test environment in our Linux virtual machine, with the entire application of the process via the Linux command line.

Below are the steps of the process

# Creating a predictive model for unsupervised learning
# Create a folder on HDFS
hdfs dfs -mkdir /mahout/clustering
hdfs dfs -mkdir /mahout/clustering/data
# Copy datasets to HDFS
hdfs dfs -copyFromLocal news/* /mahout/clustering/data
hdfs dfs -cat /mahout/clustering/data/*
# Convert dataset to sequence object
mahout seqdirectory -i /mahout/clustering/data -o /mahout/clustering/kmeansseq
# Convert the sequence to TF-IDF vectors
mahout seq2sparse -i /mahout/clustering/kmeansseq -o /mahout/clustering/kmeanssparse
hdfs dfs -ls /mahout/clustering/kmeanssparse
# Building the K-means model
# -i directory with input files
# -c destination directory for centroids
# -o output directory
# -k number of clusters
# -ow overwrite
# -x number of iterations
# -dm distance measurement
mahout kmeans -i /mahout/clustering/kmeanssparse/tfidf-vectors/ -c /mahout/clustering/kmeanscentroids -cl -o /mahout/clustering/kmeansclusters -k 3 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
# View files on HDFS
hdfs dfs -ls /mahout/clustering/kmeansclusters
# Dump the clusters to a text file
mahout clusterdump -i /mahout/clustering/kmeansclusters/clusters-1-final -o clusterdump.txt -p /mahout/clustering/kmeansclusters/clusteredPoints/ -d /mahout/clustering/kmeanssparse/dictionary.file-0 -dt sequencefile -n 20 -b 100 
# View the clusters.
cat clusterdump.txt

Copying datasets to HDFS process.

Then we apply the pre-processing, that is, format the data so that it can be delivered to the algorithm, creating a sequence of words since we are working with the analysis of texts and then convert it into a sparse matrix, as we can see below.

# Convert the sequence to TF-IDF vectors
mahout seq2sparse -i / mahout / clustering / kmeansseq -o / mahout / clustering / kmeanssparse

hdfs dfs -ls /mahout/clustering/kmeanssparse

Building the K-means model

mahout kmeans -i / mahout / clustering / kmeanssparse / tfidf-vectors / -c / mahout / clustering / kmeanscentroids -cl -o / mahout / clustering / kmeansclusters -k 3 -ow -x 10 -dm org.apache.mahout.common .distance.CosineDistanceMeasure

It is worth mentioning that for this case, as we only have 7 files and because it is a laboratory, we will work our simulation with k = 3.

View files on HDFS

hdfs dfs -ls /mahout/clustering/kmeansclusters

We are not able to observe the files that are in HDFS because they are binary, therefore, the reading of the binary file is performed by an Apache Mahout tool called Dump cluster that converts to txt format, according to the command and the figure below.

Dump the clusters to a text file

mahout clusterdump -i / mahout / clustering / kmeansclusters / clusters-1-final -o clusterdump.txt -p / mahout / clustering / kmeansclusters / clusteredPoints / -d /mahout/clustering/kmeanssparse/dictionary.file-0 -dt sequencefile - n 20 -b 100 \
For this process we will apply -n 20 to list only the top 20 words in the list and not all words.

Below we have the visualization of the result in txt on our panel.

Above we see the presentation of Apache Mahout which is not very friendly. The VLs represent the 3 clusters.

The lines below are part of the 7 files that were divided into the 3 clusters. All of these identifications can be understood in the official documentation, without it it is not possible to clearly understand the results.

1.0: [distance = 0.04297090059729214]: [{"domination": 1.847}, {"has": 1.56}, {"it's": 1.847}, {"tennis": 1.847}, {"top": 3.186}, { "venus": 1.847}, {"williams": 1.847}]
1.0: [distance = 0.2118848347173945]: [{"about": 1.847}, {"has": 1.56}, {"it's": 1.847}, {"tennis": 1.847}, {"venus": 1.847}, { "williams": 1.847}, {"won": 1.847}, {"world": 1.847}]


With these results, we can perform the labeling of the results so that it is possible to apply supervised machine learning. The algorithm performed the segmentation of the clusters by similarity of distancing, starting from the mathematical calculation of distance between the words, the algorithm assumes that they have some type of similarity. However, the algorithm does not manage to label the classification of these clusters, so, someone will have to identify if these news refer to what type of news, which in this case, would be 3 different types of news.