Supervised Machine Learning for Big Data via Apache Mahout

Updated: Sep 1, 2020


Introduction


Apache Mahout is a tool dedicated to Data Science disciplines, such as Machine Learning. It allows the use of the main clustering algorithms, statistical modeling, and regression tests in an open-source Hadoop ecosystem with HDFS storage in a distributed processing regime in clusters with several distributed machines that allow horizontal scalability and efficiency to work with Big Data on the scale of petabytes.

The algorithm to be chosen depends on the business problem to be solved and we must also consider the volume of data, when the volume of data is Big Data, then we must consider the choice of parallel and non-sequential algorithms. That is why the importance of Apache Mahout, because it meets the application of parallel algorithms such as Randon Forest, Naive Bayes, k means, etc.

The most used frameworks like R and Python for ML applications are not useful when handling data over 500 gigs, so Apache Mahout has its role when it comes to ML applications in Big Data.

Objective


Our main objective is to classify with Machine learning in a data charter database. It is worth mentioning that the purpose of this article is not about machine learning, but about the infrastructure for machine learning applications for Big Data with Apache Hadoop through the Apache Mahout tool.

As Machine Learning works based on the history of the data, we will select data already classified as spam and not spam and feed our model for the application of Machine Learning to make predictions.

For more details about the Machine Learning process, we have other articles on our blog, specific to machine learning. Therefore, we will not go into detail here with many details about machine learning.


Technology


We will use our Hadoop ecosystem with our Apache Mahout properly installed for our test environment in our Linux virtual machine, with the entire application of the process via the Linux command line.


Process Steps


Therefore, we will follow below the procedures of our test environment for applying machine learning with Apache Mahout, as described in the sublime text file below.


# Predictive Model Creation with Naive Bayes

# Create folders on HDFS
hdfs dfs -mkdir /mahout
hdfs dfs -mkdir /mahout/input
hdfs dfs -mkdir /mahout/input/ham
hdfs dfs -mkdir /mahout/input/spam

# Copying data from the local filesystem to HDFS
hdfs dfs -copyFromLocal ham/* /mahout/input/ham
hdfs dfs -copyFromLocal spam/* /mahout/input/spam

# Convert the data to a sequence (mandatory when working with Mahout)
mahout seqdirectory -i /mahout/input -o /mahout/output/seqoutput

# Convert the sequence to TF-IDF vectors
mahout seq2sparse -i /mahout/output/seqoutput -o /mahout/output/sparseoutput

# View the output
hdfs dfs -ls /mahout/output/sparseoutput

# Split dos dados em treino e teste
#	-i	                    folder with input data
#	--trainingOutput	    training data
#	--testOutput		    test data
#	--randomSelectionPct	percentage of data split
#	--overwrite			    overwrite
#	--sequenceFiles		    input sequencial
#	--xm				    type of processing.
mahout split -i /mahout/output/sparseoutput/tfidf-vectors --trainingOutput /mahout/nbTrain --testOutput /mahout/nbTest --randomSelectionPct 30 --overwrite --sequenceFiles -xm sequencial

# Construction of the Predictive Model
#	-i	training data
#	-li where storage the labels
#	-o	where storage the model
#	-ow	overwrite
#	-c	complementary
mahout trainnb -i /mahout/nbTrain -li /mahout/nbLabels -o /mahout/nbmodel -ow -c

# Model Testing
#	-i  folder with the test data
#	-m	template folder
#	-l	labels 
#	-ow	overwrite
#	-o	folder with forecasts
#	-c	complementary 
mahout testnb -i /mahout/nbTest -m /mahout/nbmodel -l /mahout/nbLabels -ow -o /mahout/nbpredictions -c

The above process performs data collection and makes available in HDFS for the application of the mapreduce job through the code below, then we obtain the output result as shown in the figure below.


hdfs dfs -ls /mahout/output/sparseoutput

Then mahout creates a word list from the original data in HDFS and places it in the matrix sparse, as we can see in the wordcout in the figure below.


After pre-processing, the next step, according to the Machine Learning process pipeline, and split the data to perform the training (70% of the data) and test (30% of the data) steps by applying the code below according to our initial process as shown below.


mahout split -i /mahout/output/sparseoutput/tfidf-vectors --trainingOutput /mahout/nbTrain --testOutput /mahout/nbTest --randomSelectionPct 30 --overwrite --sequenceFiles -xm sequencial

Then we performed another job mapreduce by applying the Naive Bayes algorithm to perform the training, as shown in the figure below.

It is worth remembering that the reading and output of data are available in HDFS.



Therefore, we have the machine learning model duly trained and recorded on HDFS above.

Then we will perform another job reduce, where we have the application of our last code to apply the model test with the test bank that we separated in the previous data split, that is, presenting to the machine learning algorithm a database that he has not yet seen to make the forecasts, so we will be able to assess his final accuracy, evaluating the predictive successes and errors.


mahout testnb -i /mahout/nbTest -m /mahout/nbmodel -l /mahout/nbLabels -ow -o /mahout/nbpredictions -c

Conclusion


The model presented an accuracy of 100% which is not common, usually results with 100% accuracy probably have some type of overfitting. However, the Naive Bayes algorithm is very powerful and the database we use, as it is a test environment, is small when compared to a Big Data production environment.

There was also no incorrect classification and our confusion matrix that presents the comparison between the observed values and the predicted values of the model to verify the model's errors and success rates.

Another important factor is the model's confidence rate, which was around 67% because we used a small sample.

48 views