Updated: Jun 17, 2020
There are numerous forms of data extraction in Big Data and several specific tools for each type of extraction, however, in this laboratory we will present the extraction of data from a relational database of an Oracle database for distributed parallel storage. in HDFS, using the tools of specific technologies for this process. This article is not a tutorial but only a generic approach to the process, with the presentation of the main steps of this process that guarantees the extraction, transformation and loading of Big Data.
Create a test environment to collect millions of movie rating records and make it possible to use that data to build a movie recommendation system for your customers.
The data is stored in a relational database and the company has a Hadoop cluster for distributed storage and processing. Your job is to take the data from the source to HDFS for further analysis.
Our goal is to collect millions of movie rating records that are stored in an oracle relational database and the company has a Hadoop cluster for distributed storage and processing. Your job is to take the data from the source to HDFS for further analysis.
For this project we will set up our test environment by installing an oracle database on our Linux system and use the Apache Sqoop tool from our Apache Hadoop ecosystem to perform ETL tasks. All tools are open source.
Preparing the Oracle Database
Importing data into Oracle
Importing Oracle data to HDFS via Sqoop
Selecting HDFS files
Preparing the Oracle Database
First, we need to enable the installation of the Oracle database to create the environment of the laboratory so that it is possible to carry out our ETL laboratory, as shown in the figure below.
The next step is to load 20 million records into the Oracle database, creating the Scheme and then we will use Apache Sqoop as an ETL tool and we will take some of that data to our repository with HDFS.
The first step is to create the schema in the database and grant privileges according to the script below,
As we have a test environment here, we will grant “unlimited” space in the database, but in a real scenario this space must be limited and managed to avoid errors.
Then we access the database through the user created according to the script below.
The next step now is to create the tables in the relational database using the DDL command and download the file to load the data with more than 20 million records. This file was collected in a zip file from the grouplens website, as shown in the figures below.
After downloading, we unzipped the file on our Linux operating system, and within the unzipped file, we will select the desired file “ratings.csv” that contains the schema of our created table and its respective variables as per the dictionary below.
ID: identification of each record
USER_ID: ID of the user who evaluated the movie
MOVIE_ID: ID of the movie that was rated
RATING: score from 1 to 5 that the user gave the film
TIMESTAMP: complete number of the exact evaluation date
The next step now is to load the data, creating a directory called ETL and within that directory we will create a file with the load command, as shown in the figure below.
Then we have the successful load of 20 million records as shown below.
Then we double-checked the load file to verify that all data was read correctly, as shown below.
According to the load file above, we found that a line had an error, but it refers to the header that has no numbers, so the file was read successfully.
Then we read the file and verify that it has more than 20 million records, as shown below, thus validating our test environment in comparison to the real world of Big Data.
The next step now is to start the process of moving data from a sql relational database to a hadoop environment. For this purpose, we use the Sqoop tool, which means sql for hadoop, that is, it is one of the tools of the hadoop ecosystem to work with a two-way sql database, allowing to import and export data in HDFS.
Sqoop creates jobs to carry out the mapreduce process, avoiding all the work of developing a job to move data between relational databases and HDFS.
The main advantages of Apache Sqoop are:
• Allows the import of external databases and enterprise data warehouses;
• Promotes data transfer to improve performance and optimize the use of the system;
• Copy data quickly from external sources to Hadoop HDFS;
• Makes data analysis more efficient;
• Avoids excessive loads for external systems;
Another advantage is that we can use Apache Sqoop together with Apaxhe Hive, as shown below.
ETL Hdoop = Sqoop + Hive
The next step is to configure apache Sqoop access to the Oracle database through the jdbc connection and since Sqoop does not have the drive that performs this operation, we will have to download the jdbc driver from the oracle website. If we were using another database, we would have to perform this same process, going to the supplier's website to download the standard jdbc file for each database.
This process allows the Oracle database to be able to see HDFS and Sqoop.
In the oracle user, you need to set the environment variables for Hadoop and Sqoop in the ˜ / .bashrc file
# Java JDK export JAVA_HOME=/opt/jdk export PATH=$PATH:$JAVA_HOME/bin # Hadoop export HADOOP_HOME=/opt/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin # Sqoop export SQOOP_HOME=/opt/sqoop export PATH=$PATH:$SQOOP_HOME/bin export HCAT_HOME=/opt/sqoop/hcatalog export ACCUMULO_HOME=/opt/sqoop/accumulo
Oracle despite seeing HDFS, it does not have write permission, since only the hadoop user has this permission. Therefore, we need to configure this permission from the Oracle database to HDFS.
To perform this permission, since this article is performed in a test environment, we will use the HDFS owner to grant the privilege in a simplified way directly from the directory, giving permission for any user to record on HDFS. This is the way to take data from Oracle to HDFS.
Sqoop's access to HDFS is through the use of user privileges. In a production environment, the authentication protocol configuration would take place through the kerberos, which would be the responsibility of the data engineer and not the data scientist.
So, accessing how to use hadoop, we define the privileges through the commands below, the 777 referring to full access, as we are only in a testing environment and not in production:
This command changes the privilege on HDFS:
hdfs dfs -chmod -R 777 /
This command changes the privilege in the operating system:
chmod -R 777 / opt / hadoop / logs
So now Oracle has full permission on HDFS, as shown in the figure below and we have the environment ready to carry out our laboratory.
Then we started the final step of transferring data from Oracle to HDFS, using the sqoop import code below.
sqoop import --connect jdbc:oracle:thin:firstname.lastname@example.org:1539/orcl --username aluno -password dsahadoop --query "select user_id, movie_id from cinema where rating = 1 and \$CONDITIONS" --target-dir /user/oracle/output -m 1
Below is the successful import report.
Anyway, in the figure below we check the output directory.
Below is the job reduce performed by sqoop, which demonstrates that this tool is simplifying the process, as it avoids the development of the writing of the entire programming described in the figure below.
Below we visualize the file that we imported in HDFS.
Therefore, we demonstrate the feasibility of importing Big Data files from an Oracle database to HDFS through the Sqoop tool in a test lab environment that simulates a real production environment with clusters, thus adding all the advantages of working in clusters when it comes to Big Data.