Big Data using Hadoop and Python for structure data with Billions and Trillions of lines

Updated: Sep 1, 2020


This Big Data scenario design article uses a database for simulating Hadoop that enables the collection, storage, and processing of billions over trillions of lines with reliability and efficiency, generating the outputs that are the objective of the article, which would not be viable with traditional technologies.


We will analyse a Big Data database of film classification, with a survey with a series of evaluations from 1 to 5. The program will count how many evaluations were made for each score, being 1 for negative evaluation and 5 for positive evaluation.


The reason for this set of tools that make up Hadoop is to allow the processing and storage of large amounts of data in a distributed way, that is, using low cost and fault tolerant computer clusters.

This processing is divided into several nodes or clusters, to maximize the computational power. For simplicity, a cluster is the set of hardware that works synchronously to function as if it were a single computer. Thus, several machines act in an organized way as if they were one.

This clustering is necessary because a single server would not be able to process that much data. In this way, it is possible to offer storage, processing, access, security, operation and governance.

We will use the Linux operating system on virtual machines and the python tool for programming that will be carried out in Hadoop clusters as we can see in the figure below.

Initially we will import the Big Data data to HDFS and Initialize HDFS and YARN for process management, as shown below.

Creating directory for MapReduce, unzipping the Big Data file.

Then we do a pre-visualization of the data using the sublime text editor.

In the subime text above, the first column refers to the user ID, the second column and the film ID, the third column and the score of the film made by the user, and the 4 column and the exact time when the evaluation was performed .

Now let's copy this file from the operating system into HDFS using the command below.

Then we use the Python language to execute MapReduce, creating a class with both methods, a called “mapper” to map the data and the second method “reducer” to reduce the mapped data as we can see below.

Then we will execute the python code via terminal on the file that is already in HDFS and the process will be carried out in Hadoop clusters.

After processing, we have the result below

"1" 6110

"2" 11370

"3" 27145

"4" 34174

"5" 21201


Therefore, the Big Data tool enables the collection, storage and processing of large amounts of data on the petabyte scale, generating the outputs according to the business problem, which in this article, was to obtain a descriptive statistics of film score evaluation from 1 to 5.

The final result showed that the evaluation 4 was the one with the highest score.

This article was just a simple demonstration of the potential of Hadoop as a tool for Big Data and several other insights could be extracted according to the business problem.