Big Data with Apache Spark

Updated: Sep 1, 2020


Apache Spark is an open source framework for distributed computing. It was developed at AMPLab at the University of California and later passed on to the Apache Software Foundation that has maintained it ever since. Spark provides an interface for cluster programming with parallelism and fault tolerance.

The difference between Spark and Hadoop is that Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively. With Hadoop MapReduce, a developer can only process data in batch mode only whereas Spark can process real-time data through Spark Streaming.

Spark is the most used framework for Big Data, it has better performance because the process is carried out in memory while Hadoop processing is carried out on the hard drive. Therefore, Hadoop is more recommended for data above 1 Tb and when the best performance is not a critical requirement.


Collect a text file and run an application to count the frequency of each word in standalone mode and then we will describe how to create a cluster via remote access.

Spark in Standalone Environment

We will run the application in a standalone environment only for the purpose of testing the automated application of the pyspak, since the spark's usual use in a production environment has its full utility in a clustered environment with many machines.

First, we need to write the code for automating the application in python language as shown below.

import sys

from pyspark import SparkContext, SparkConf
# 2 functions, the function SparkConf will create the configuration, the aplication name, the server that it will be executed 
#  and the SparkContext will create the driver to connect the database, and in this situation it will connect the Spark 

# The Sparksubmit will execute the application, so we should to show
# Bellow we will define the main block. How we wont define our application via python or Jupiter, the application will be defined 
#via Spark submit and for this reason we need to present everything already, otherwise, we wont be able to execute it.
# Its common when we have an standalone (autonomous) application.

if __name__=="__main__":

# create SparkContext because the pyspark wont be executed in this case
	conf = SparkConf().setAppName("Word count").set("master", "local")
	sc = SparkContext(conf = conf)
	#the next step is to upload the text file, in this specific case how we are working in standalone and do not 
	# with HDFS, we input the file address from our computer
	# IPC : words is a RDD object, when we use textFile, it creates a RDD and the Spark will handle it as a RDD
	# we can do actions and transformations wit RDD in Spark 
	#RDD is immutable, so when we do some transformations, it generates a new RDD
	words = sc.textFile("/home/hadoop/input.txt").flatMap(lambda line: line.split(" "))
	#All words from the text are separated by space, so we can do the first transformations when we uploaded the file
	# so we uploaded the file, splitting the words considering the spaces between the words to create a word list
	# "flatmap" is a transformation function
	# the function "lambda" is an anonymous function   
	# "line" represents each object line 
	# This code will return a word list 

	# After the last code, we have the word list  and the next step is the mapreduce process
	# to count how many times each words appears 
	# Each Object has methods and attributes
	counting = word: (word, 1)).reduceByKey(lambda a,b:a +b)

	# Saving the file
	# We can not consider the name file in the address place 

Then after the file above saved with gedit from linux, we apply the spark as below.

There are 4 files that are generated by Apache Spark as shown in the figure below, that's why we created the directory path at the time of creating the application.

Then we apply gedit to view the data as shown below.

This is one of the first steps in the natural language process. The majority of natural language techniques, which today is one of the most advanced techniques of artificial intelligence, uses this mapping and reduction process to account for the frequency of each word and from that point, the context creation process for each word can begin. word and then continue the natural language process, the mapping and reduction process being the basis of many other analytical processes that can be performed.

Spark in a Cluster Environment

We will now perform this operation performing a laboratory simulating a real scenario, that is, through remote access to the cluster on the server working on the client and server scheme. To simulate this scenario, we will perform the access from our physical machine to our virtual machine. The virtual machine, in a real production scenario, would be our server that could be a cloud environment, on a customer's network or anywhere else.

Another important point is to perform the update of the spark, since the spark must have the same version in the environment of the server and the client for a remote connection to be successful.

Spark update procedure below

First we have to check the current version as shown in the figure below.

The great challenge of working with open source tools is that there is a frequent demand for updates, so it is important to follow the original documentation site with some frequency, where it is possible to carry out updates and downgrades to make versions between client and server compatible.

It is worth mentioning that the use of preview versions is not recommended, as they are useful for the non-production development environment.

Another important factor is that every opensource tool has a history of versions in release archives, where you can find any version and download it.

Then we will carry out the release for the version below.

First we have to check if the local and server versions are the same.

For the creation of the spark cluster, we need to make the following configurations below, since practically all products of the hadoop ecosystem are configured through the configuration files and many files come as a template to prevent it from being used improperly.

Therefore, we need to create a copy taking the template so that it can be isolated, according to the process below.

These are the two files that need to be configured to create the spark cluster.

Next step is to edit the environment file for spark.

As we can see in the figure above, the entire file is commented, which means that no item in this file is being used, so spark uses default values defined by spark itself. However, we can make some changes and we will make two changes to this file.

The first change will be in the file below which is the IP address of apache spark.

# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node

The second change below refers to the mast's IP which in our laboratory is our virtual machine.

# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname

We see that below we have the two IP addresses, one from our physical machine and the other from our virtual machine that will work as the server's IP.

Checking if the IP of our physical machine is communicating with our virtual machine, according to the command below in the command prompt of our physical machine.

The next step is to configure the Spark host and Spark master IP as shown below.

These two parameters are mandatory, the others are at the user's discretion.

Next step is to edit the slave file by entering the IP, in this specific case, it will be the same IP because the master and the worker will be on the same machine, as we are conducting a laboratory simulation of a cluster production environment with only one machine. .

In a production environment, we have a machine for the host and numerous other machines for the worker.

It is worth mentioning that these configurations are the responsibility of the data engineer and not the data scientist.

We removed localhost and inserted the IP address of our machine so that Spark allows remote access, otherwise he would understand that access could only be done at localhost.

Next step and initialize the cluster as shown below, the figure below shows everything you need to start spark.

Then just start the cluster using a complete initialization script as shown below, without the need to start hadoop.

Now we can perform remote access through the browser using our IP and the Spark 8080 default port as shown below.

It is worth remembering that our environment is for testing and that is why the access security protocols were not configured in the production environment, this process is performed by the data engineer.

So we have finished configuring our server The Spark worker and master must be the same version and the files must be unzipped in the folder.

Below is the command for remote access.

spark-shell --master spark: // 7077

After applying the code above, just update the page of the Spark Master server, we will see the application of the new worker, as shown in the figure below.

After completing remote access, the map reduce process works following the same logic as the standalone seen above.


Therefore, we present the mapreduce standalone process through Apache Spark, one of the main Big Data tools, and its advantages and differences with Hadoop and when we should use it. We also present in parallel, the entire data engineering process for creating clusters with master and workers for remote access