Real Time Sentiment Analysis via Spark and Python using Twitter

Updated: Nov 23, 2020



Introduction


The traditional data processing models are performed with finite datasets, while Spark streaming performs microbatching processing in batches of batch data similar to real time, where each batch is segmented by a pre-defined time measure by the analyst. The streaming processing model is applied to several areas, such as:


• ETL streaming

• Anomaly detection

• Enrichment of data and complex sessions and continuous learning.

• Fraud detection

• Spam filter

• Network intrusion detection

• Analysis of social media in real time

• Stream Analysis of Clicks on Sites, generating Recommendation Systems

• Real-Time Ad Recommendation

• Stock Market Analysis


With Spark streaming data can be collected and analyzed, transformed and summarized for the application of machine learning for classification and forecast in real time. Thus it is possible to classify anomalies, frauds, etc. in real time to make decisions in real time, thus performing iterative processing with the performance of various tasks in sequence and interactive processing with exploratory data analysis.

The analysis of feelings is the use of natural language processing, text analysis, data mining and linguistic computing to identify text elements.

In general terms, sentiment analysis aims to determine someone's attitude towards some topic or contextual polarity in documents, that is, in the way words are written and grouped within a text, we can identify the person's feeling whom wrote , whether there is a positive, negative or neutral connotation.

Generally, most companies work with batch data and not in real time, so there is an entire timeline for delivering data analysis to decision makers. In the traditional productive scenario, the standard of data analysis in companies is collecting data on a Monday, then on Wednesday it cleans the data, analyzes it on Thursday and delivers the results on Friday.

Other more advanced companies, collect data during the day, at night do the processing and the next day deliver the report to the decision makers.

However, our analysis of feelings in real time shortens this whole process and thus manages to meet the demands of analysis that need to be carried out in real time so that decision makers can make decisions in real time.


Objective


Twitter is one of the most dynamic social networks available today. It is possible to collect in real time precious information and people's feelings on the most varied topics. There are millions of tweets per minute worldwide and a lot of precious information is hidden in each tweet. Data streaming generated by Twitter can feed analytical applications, allowing companies to understand, in real time, what customers, partners and suppliers are thinking ( and writing) about you, your brand, product or service.

In this project, we collect Twitter Streaming data, apply real-time analysis techniques (as data is generated) and obtain insights on a given subject

Our goal is to carry out an analysis of twitter feelings with the term Trump in 2020 after the “black lives meter” movement, collecting data from the twitter in real time as they are generated and checking if these twitters have a positive, negative or neutral connotation , using the natural language processing method.

In this way, we will be able to make some inferences, in addition to public opinion polls, on the public opinion regarding Trump, since it is an electoral year period.


Technology


We will use Spark Streaming and Python using one of the main frameworks for natural language processing called Natural Language Tool Kit (NLTK) through the Linux Ubuntu operating system and the twitter for developers app.

Below is the configuration used in Ubuntu.



The Process


The first step is to create our app on twitter as shown below


Twitter data collection

• Create your Twitter account

• Got to https://apps.twitter.com/app/new

• Create new app

• Save your credentials


Now we have to check what kind of information we will need for our analysis tool.

Next, we check our access security keys generated as shown in the figure below, which should be pasted into our analysis tool.

We will not show the security keys for privacy and security reasons.

Consumer_key: XXXXXXXXX

Consumer_secret: XXXXXXXXX

Access token: XXXXXXXXX

Access token secret: XXXXXXXXX


Then we need to check if we have permission to read and write according to the figure below.


Below is the original file that we will use as a database already classified, as described in our python script below. See that it has an unstructured format and what we will do in the next steps is organize this data so that it can be properly processed by our learning algorithm.



Below is our python script, where we can find all the documentation considerations of all stages of the NLTK process for the analysis of feelings to achieve our initial objective.








Below is the data being collected in real time, which can be viewed in our terminal, where we open the door for connection. See that this data is already clean.





Conclusion


Therefore, after extracting all the data in an organized way, we export the content to create a visualization of the final classification. So we can evaluate the percentages and whether the final result of the sentiment analysis was positive or negative.


Percentage of Twitters (Positive-green / Negative-red) about Trump each 5 seconds


As noted in our chart, we identified that currently the result of the sentiment analysis towards Trump is negative, so, in a productive environment, this monitoring could support the decision making of candidates for the presidential race that will take place at the end of 2020.

This feeling analysis management can be a support tool in the performance management during the presidential race.

This is just a laboratory, however, there are many applications of this tool in the management of decision making in real time.

230 views