Big Data using Data Mining with Hadoop and Python for Unstructured Data

Updated: Jun 17, 2020


This Big Data scenario design article uses a database for simulating hadoop that enables the collection, storage and processing of billions over trillions of lines with reliability and efficiency, generating the outputs that are the objective of the article, which would not be viable with traditional technologies.


We will map all the words in a book and create a list of key and value to identify which words appear most often in the text that can be used in an artificial intelligence system so that it is possible to learn what are the characteristics of the text, natural language , analysis of feelings, etc.


The reason for this set of tools that make up Hadoop is to allow the processing and storage of large amounts of data in a distributed way, that is, using low cost and fault tolerant computer clusters.

This processing is divided into several nodes or clusters, to maximize the computational power. For simplicity, a cluster is the set of hardware that works synchronously to function as if it were a single computer. Thus, several machines act in an organized way as if they were one.

This clustering is necessary because a single server would not be able to process that much data. In this way, it is possible to offer storage, processing, access, security, operation and governance.

We will use the linux operating system on virtual machines and the python tool for programming that will be carried out in hadoop clusters as we can see in the figure below.

Initially we store the dataset in HDFS and then we use the python programming language to create the mapping and reduction logic according to the script in the figure below.

Hadoop script

Visualization of the original unstructured data from the dataset

Python script with programming logic.

The result of our process came with data that needs to be cleaned, as we can see in the figure below, the same word appears with “,”, “/”, etc. However and the same word. Therefore, we will now need to correct and clear this data.

Therefore, the next step is to perform a filter to perform data cleaning because for the purpose of this article, these characters are not important.

To perform this cleaning, we will use python with the following logic of the script below that returns only the words of regular expressions, ignoring punctuation, numbers, special characters, etc.

Then we run the mapreduce job again with the appropriate filters and obtain the clean data as we observed in the figure below.

As we noted, there is a large amount of data, so, to facilitate observation, we will present the data in a more summarized form, presenting the words that had the most frequency and the words that had the least frequency, using the logic of the Python programming language. with jobs aligned with Mrstep to perform several mapping and reduction steps in the same job, considered in As noted in the figure below.

As we can see in the figure below, we now have a more organized presentation of the data.


Therefore, the application of Big Data using data mining for unstructured data, allows the storage of a large mass of data in HDFS and its processing via MapReduce and even the data mining process for cleaning the data during the process, in order to allow a better presentation of the result regarding the objective of the business problem, which was to identify the words according to their frequency of occurrence.