Social Network Analytics

Updated: Sep 1, 2020




Introduction


Currently, there is a growing investment in social media, which has led companies to make major marketing investments using these channels.

Consumers use these media as a means of consuming information, services, and products and this generates an ocean of data that can generate valuable insights and only 20% of this social media data is relevant, hence the importance of data science to identify this valuable data is what we call social data mining.

This tool can identify potential customers and further deepens knowledge about customer behavior and is a strategic marketing differentiation in support of decision making.

The general idea of this article is to present how a social media data science modeling can be carried out and for this purpose we will use twitter. However, all other applications like Facebook and Instagram also have the same modeling methodology, with some differences in authentications, but in general, this modeling with Twitter shows how it is possible to work with social media.


Objective


Our objective in this article is to elucidate in a basic way the operation of this text mining technology using Twitter, showing how the collection of text data can be performed, treated and visualized for analysis, how it is possible to perform up date of the company's twiter automatically through Twitterbot and how to use the text data from twitter offline. After all process, we will implement the machine learning process and will create a recommendation system.



Lifecycle


  • Obtain authentication in the social media

  • Data visualization and exploratory analysis

  • Cleaning and pre-processing

  • Data modeling

  • Result display

  • Applying Machine Learning

  • Applying Recommendation System


Technology


Analysis of unstructured Big Data data with a wide range of noise through the R programming language.


Obtain authentication in the social media


The first step in the process is the creation of an API that can access your twitter account to perform data collection.

An open pattern to allow secure authorization in a simple and standard web, mobile and desktop application. Oauth is an authorization protocol for web APIs aimed at allowing client applications to access a protected resource on behalf of a user. We will use this API to access data from the social media, carrying out various types of searches, except users' private content, through an access token that expires over time and can be renewed.

To this end, we created a developer account and enabled this authentication in R according to the script below:

# Defining access keys
api_key <- "XXX"
api_secret <- "XXX"
access_token <- "XXX"
access_token_secret <- "XXX"
# Authenticating on Twitter
setup_twitter_oauth (api_key, api_secret, access_token, access_token_secret)

These access tokens are confidential to the user and therefore will not be shown in this article for privacy reasons.

The authentication.r script connects to twitter.

In programming R we have a tool called “geocode”, which is widely used and allows us to use the geographic position from which tweets are being made, being able to carry out various filters, such as, city, neighborhoods, etc. , this allows us to develop marketing actions considering the latitude and longitude of the users and thus enhancing the company's marketing response actions to these potential consumers, thus expanding the marketing horizons with much more effectiveness.

For this purpose, it is enough to have a professional capturing these twiters in real time and thus generating an interaction in response to consumers.

The searchtwiter function allows us to perform searches by words and by users, only in the case of public user posts, respecting the privacy policies.

The next steps will be data collection on Twitter.

> # Collecting tweets
> # we can collect data being the filter by words, language and even by users,
> # respecting privacy policies
> ?searchTwitter
> pandemia_tweets = searchTwitter("pandemia", n = 100, lang = "en")
> covid_tweets = searchTwitter("covid", n = 100, lang = "en")
> #dsa_Tweets = userTimeline(getUser('dsacademybr'), n = 100)
> 
> # Print
> head(pandemia_tweets)
[[1]]
[1] "sollunna: You are so ignorant will is doing a lot of thing during these pandemia where is harry https://t.co/HutupixXLD https://t.co/2Sfa42D8hA"

[[2]]
[1] "BruBernardes: RT @rntdlsss: sim, a pandemia"

[[3]]
[1] "_Serpens_: RT @rntdlsss: sim, a pandemia"

[[4]]
[1] "blackfyre_: RT @rntdlsss: sim, a pandemia"

[[5]]
[1] "milebinha: RT @rntdlsss: sim, a pandemia"

[[6]]
[1] "Saarianebastos: RT @rntdlsss: sim, a pandemia"

> class(pandemia_tweets)
[1] "list"
> head(covid_tweets)
[[1]]
[1] "vintaquin: RT @Wardamn5: COVID didn’t have the desired effect... time to roll out the MK Ultra’s &amp; FF’s.. \n\n\U0001f440\n“Kansas soldier saves 'countless lives'…"

[[2]]
[1] "JohnGri51420377: RT @vicksiern: Why do you think the CDC and the Democrat politicians don't want you taking HYDROXYCHLOROQUINE if you get Covid-19?"

[[3]]
[1] "eboygenji: i don’t think i have covid but i’ve been feeling terrible for the last few days \U0001f614"

[[4]]
[1] "ChaneciaA: RT @DjWalt_: Covid-19 almost made me forget about the real virus in America."

[[5]]
[1] "DragonHawk1959: RT @NorskLadyWolf: The Dems blasted Trump’s CDC crony on the pitiful report that was mandated by Congress on the effects of COVID-19 on min…"

[[6]]
[1] "slobzilla: Bloomberg has the audacity to quote Cuomo comparing this to multiple sold out Yankee Stadiums when his ORDER to sen… https://t.co/O3ZLJxz2Lt"

> class(covid_tweets)
[1] "list"

Data cleaning process


During the collection, along with the data, we also import irrelevant data that needs to be extracted. To fulfill this purpose, we created a specific cleaning function, where we mention what should be excluded from the original database. Among these data to be excluded, we have the example below:

http, http links, retweets, #Hashtag, username “@people”, punctuation, numbers, unnecessary spaces, Converting character encoding and converting to lower case.

It is worth mentioning that some items more or less can be considered according to the business problem.

As good practices, we always create a new object for new treatments, thus preserving the initial objects, allowing us to understand the evolution of the process and maintaining the integrity of the original data, as practiced in programming language directed to the object.

 > # Converting tweets to text
> # we will use the sapply function and an oop for that will cycle through the entire database
> # and getText will extract texts
> textos_pandemia = sapply(pandemia_tweets, function(x) x$getText())
> head(pandemia_tweets)
[[1]]
[1] "sollunna: You are so ignorant will is doing a lot of thing during these pandemia where is harry https://t.co/HutupixXLD https://t.co/2Sfa42D8hA"

[[2]]
[1] "BruBernardes: RT @rntdlsss: sim, a pandemia"

[[3]]
[1] "_Serpens_: RT @rntdlsss: sim, a pandemia"

[[4]]
[1] "blackfyre_: RT @rntdlsss: sim, a pandemia"

[[5]]
[1] "milebinha: RT @rntdlsss: sim, a pandemia"

[[6]]
[1] "Saarianebastos: RT @rntdlsss: sim, a pandemia"

> textos_covid = sapply(covid_tweets, function(x) x$getText())
> textos_covid[1:10]
 [1] "RT @Wardamn5: COVID didn’t have the desired effect... time to roll out the MK Ultra’s &amp; FF’s.. \n\n\U0001f440\n“Kansas soldier saves 'countless lives'…"  
 [2] "RT @vicksiern: Why do you think the CDC and the Democrat politicians don't want you taking HYDROXYCHLOROQUINE if you get Covid-19?"                  
 [3] "i don’t think i have covid but i’ve been feeling terrible for the last few days \U0001f614"                                                                   
 [4] "RT @DjWalt_: Covid-19 almost made me forget about the real virus in America."                                                                        
 [5] "RT @NorskLadyWolf: The Dems blasted Trump’s CDC crony on the pitiful report that was mandated by Congress on the effects of COVID-19 on min…"        
 [6] "Bloomberg has the audacity to quote Cuomo comparing this to multiple sold out Yankee Stadiums when his ORDER to sen… https://t.co/O3ZLJxz2Lt"        
 [7] "RT @marklutchman: I am more afraid of a Democrat in the White House,\n\nThan I am of COVID-19.\n\nDoes anyone else feel the same? \U0001f914"                 
 [8] "RT @BettyBowers: The USA's share of the world's population:\n\n4.24%\n\nThe USA's share of the world's COVID-19 deaths:\n\n29%\n\nThis is what inco…"
 [9] "\"Our job is to help our clients figure out where they fit in all of this” - @ddroga siempre tirando la precisa.\n\nhttps://t.co/CNyJcYfRj8"         
[10] "RT @censusproject: 2020 Census is crucial to rebuilding from COVID-19 https://t.co/dVb7xIMjHp"                                                       
> class(textos_covid)
[1] "character"
> 
> # Cleanin the tweets
> textos_pandemia_limpo = textos_pandemia
> textos_pandemia_limpo <- limpaTweets(textos_pandemia_limpo)
> head(textos_pandemia_limpo)
[1] "you are so ignorant will is doing a lot of thing during these pandemia where is harry"
[2] "sim a pandemia"                                                                       
[3] "sim a pandemia"                                                                       
[4] "sim a pandemia"                                                                       
[5] "sim a pandemia"                                                                       
[6] "sim a pandemia"                                                                       
> names(textos_pandemia_limpo) = NULL
> textos_pandemia_limpo = textos_pandemia_limpo[textos_pandemia_limpo != ""]
> textos_pandemia_limpo[1:10]
 [1] "you are so ignorant will is doing a lot of thing during these pandemia where is harry"
 [2] "sim a pandemia"                                                                       
 [3] "sim a pandemia"                                                                       
 [4] "sim a pandemia"                                                                       
 [5] "sim a pandemia"                                                                       
 [6] "sim a pandemia"                                                                       
 [7] "sim a pandemia"                                                                       
 [8] "sim a pandemia"                                                                       
 [9] "sim a pandemia"                                                                       
[10] "sim a pandemia"                                                                       
> class(textos_pandemia_limpo)
[1] "character"
> 
> textos_covid_limpo = textos_covid
> textos_covid_limpo <- limpaTweets(textos_covid_limpo)
> names(textos_covid_limpo) = NULL
> textos_covid_limpo = textos_covid_limpo[textos_covid_limpo != ""]
> textos_covid_limpo[1:10]
 [1] NA                                                                                                                            
 [2] "why do you think the cdc and the democrat politicians don t want you taking hydroxychloroquine if you get covid"             
 [3] NA                                                                                                                            
 [4] "covid almost made me forget about the real virus in america"                                                                 
 [5] "the dems blasted trump's cdc crony on the pitiful report that was mandated by congress on the effects of covid on min..."    
 [6] "bloomberg has the audacity to quote cuomo comparing this to multiple sold out yankee stadiums when his order to sen..."      
 [7] NA                                                                                                                            
 [8] "the usa s share of the world s population \n\n \n\nthe usa s share of the world s covid deaths \n\n \n\nthis is what inco..."
 [9] "our job is to help our clients figure out where they fit in all of this\" siempre tirando la precisa"                        
[10] "census is crucial to rebuilding from covid"  

Some data is lost during the cleaning process but this is expected.

See that the cleaning function has deleted the data below:

http, http links, retweets, #Hashtag, username “@people”, punctuation, numbers, unnecessary spaces, capital letters.

Then we convert the data to a function for manipulating text for the application of another cleaning technique, now cleaning the corpus dataset and not the previous dataset and then we convert the text to the term matrix, to identify the frequency of the terms and thus allowing visualization through graphics.

> #Converts text to the term matrix
> termo_por_documento_pandemia     = as.matrix(TermDocumentMatrix(tweetcorpus_pandemia), control = list(stopwords = c(stopwords("english"))))
> termo_por_documento_covid = as.matrix(TermDocumentMatrix(tweetcorpus_covid), control = list(stopwords = c(stopwords("english"))))
> 
> #Checks the first 10 terms (rows) with the first 10 documents (columns)
> termo_por_documento_pandemia[1:10,1:10]
          Docs
Terms      1 2 3 4 5 6 7 8 9 10
  are      1 0 0 0 0 0 0 0 0  0
  doing    1 0 0 0 0 0 0 0 0  0
  during   1 0 0 0 0 0 0 0 0  0
  harry    1 0 0 0 0 0 0 0 0  0
  ignorant 1 0 0 0 0 0 0 0 0  0
  lot      1 0 0 0 0 0 0 0 0  0
  pandemia 1 1 1 1 1 1 1 1 1  1
  these    1 0 0 0 0 0 0 0 0  0
  thing    1 0 0 0 0 0 0 0 0  0
  where    1 0 0 0 0 0 0 0 0  0
> termo_por_documento_covid[1:10,1:10]
                    Docs
Terms                1 2 3 4 5 6 7 8 9 10
  and                0 1 0 0 0 0 0 0 0  0
  cdc                0 1 0 0 1 0 0 0 0  0
  covid              0 1 0 1 1 0 0 1 0  1
  democrat           0 1 0 0 0 0 0 0 0  0
  don                0 1 0 0 0 0 0 0 0  0
  get                0 1 0 0 0 0 0 0 0  0
  hydroxychloroquine 0 1 0 0 0 0 0 0 0  0
  politicians        0 1 0 0 0 0 0 0 0  0
  taking             0 1 0 0 0 0 0 0 0  0
  the                0 2 0 1 3 1 0 4 0  0
> # Calculates the frequency of each term when adding each line and puts it in descending order
> frequencia_dos_termos_pandemia = sort(rowSums(termo_por_documento_pandemia), decreasing = TRUE) 
> head(frequencia_dos_termos_pandemia)
pandemia      sim      and   during    river   munnee 
      98       97        3        2        2        2 
> frequencia_dos_termos_covid = sort(rowSums(termo_por_documento_covid), decreasing = TRUE) 
> head(frequencia_dos_termos_covid)
  the covid   and  from  this   for 
   91    64    29    17    15    15 

Creating the graphs


#Creates a dataframe with the term (word) and its frequency
df_pandemia = data.frame(termo = names(frequencia_dos_termos_pandemia), frequencia = frequencia_dos_termos_pandemia) 
df_covid = data.frame(termo = names(frequencia_dos_termos_covid), frequencia = frequencia_dos_termos_covid) 

#Removes the most frequent term
df_pandemia = df_pandemia[-1,]
class(df_pandemia)
df_covid = df_covid[-1,]
class(df_covid)
> # Draw the word cloud
> wordcloud(df_pandemia$termo, 
+           df_pandemia$frequencia, 
+           max.words = 100,
+           min.freq = 2,
+           scale = c(3,.5),
+           random.order = FALSE, 
+           colors = brewer.pal(8, "Dark2"))
> wordcloud(df_covid$termo, 
+           df_covid$frequencia, 
+           max.words = 100,
+           min.freq = 3,
+           scale = c(3,.5),
+           random.order = FALSE, 
+           colors = brewer.pal(8, "Dark2"))

The purpose of our research was to try to identify the words that revolve around the research words, so we have removed the most frequent words, which were the words used in the research.

> # Merge of dataframes
> df_merge <- merge(df_bigdata, df_datascience, by = "termo")
> head(df_merge)
     termo frequencia.x frequencia.y
1      amp            2            5
2      and            1           25
3      are            1           14
4    blood            1            2
5 citizens            1            1
6    covid            1           57
> df_merge$freq_total <- df_merge$frequencia.x + df_merge$frequencia.y 
> head(df_merge)
     termo frequencia.x frequencia.y freq_total
1      amp            2            5          7
2      and            1           25         26
3      are            1           14         15
4    blood            1            2          3
5 citizens            1            1          2
6    covid            1           57         58

Twitterbot


Another important tool is the twiterbot that performs the automatic publication of tweets. This tool allows companies to perform web scraping and publish important information to be posted on their twitter channel, thus informing their followers.

For example, a company could collect data from the stock exchange, sales, and automatically share in its news feed on its twitter channel.

> # Web Scraping
> url <- "https://cran.r-project.org/web/packages"
> # Lendo a url
> page <- read_html(url)
> # Obtendo o num de pacotes
> n_packages <- page %>%
+   html_text() %>% 
+   str_extract("[[:digit:]]* available packages") %>% 
+   str_extract("[[:digit:]]*") %>% 
+   as.numeric()
> print(n_packages)
[1] 15713
>  # Authenticating on Twitter
>   setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
[1] "Using direct authentication"
>  # Time
>   time <- Sys.time()
> #Create the tweet
>   tweet_text <- paste0 ("Hello everybody, toda is ", time, " and right now there are", n_packages, " R packages on CRAN. Here is TweetBot LEXX Consulting in action!")  
>   # Envia o tweet
>   tweet(tweet_text)
[1] "ZRowsey: Hello everybody, toda is 2020-05-27 19:06:46 and right now there are15713 R packages on CRAN. Here is TweetBot LEXX Consulting in action!"

How we can see in the figure below, the message was posted automatically.



Using Twitter Offline


Before any Machine Learning application, it is important to assess how the company's twitter account is being managed, so it is important to check how tweets behave in the following areas:

There is some difference in the result of the posts by time and or by days of the week.

How does the number of characters in the tweet impact the user's final perception?

Does the tweet have a hashtag, interaction and responses?

These are some examples of questions that must be considered before applying Machine Learning, in order to understand whether or not the company is using the tool properly.

As there are limitations on data capture according to twitter policy, there is another tool that allows access to all the company's tweets history in CSV format, so it is possible to work with an offline database to perform statistical analyzes. and thus evaluate the management of the twitter account.

As we can see in the figure below, the data of the entire history of the company's twitter account can be exported to CSV format.


Based on this descriptive analysis of time series and statistical tests of hypotheses such as that of chisq.test (chi-square test), we can assess whether the hypotheses considered during the descriptive statistical analysis may be rejected or not, and it is possible to accurately answer the effectiveness account management, identifying strengths and weaknesses and insights for improving social media management, including defining goals and creating a real-time maintenance dashboard.

These statistical tests are not mandatory, but they are considered good data analysis and modeling practices to infer more accurately.

We can also make comparisons of tweets management performance with other company data, such as sales, marketing investments and thus more accurately identify possible correlations of other variables.

In this way, we can work with two variables from different database sources, one variable can be the company's sales history and the other data can be the history with the company's number of tweets and then assess whether there is a correlation between these variables.

If this correlation is confirmed, it is possible to consider the variable tweets within the database that will be used to make sales forecasts with machine learning, thus, it would be possible to estimate an increase in sales based on the hypothesis of an increase in the number of tweets.

Another example is the possibility of identifying a correlation between the number of tweets made, with customer satisfaction and with the number of customer service.

Therefore, there are countless business possibilities that can be carried out with an in-depth analysis of the company's social media.

But here we are not going to deepen the possibilities of descriptive statistics, we have this approach model in other articles, our goal here is just to present some possibilities of mining text data in the social media twitter.


Applying Machine Learning to our data


In other articles, we have all the steps of the machine learning process, so we will present in this article a very generic application of machine learning, so that we can better understand the usefulness of the entire process of collecting data from social networks.

From the database generated from social media, we will select only the columns that we consider most interesting from the perspective of a business problem. After this selection and possession of the dataset to be worked on, we will start the k-means unsupervised learning process, where the algorithm automatically selects patterns and groups them in groups of clusters. We have an entire article dedicated to the whole unsupervised machine learning process on our blog, if you have any questions during this generic approach.

We will use the fpc package that aims to identify the best number of clusters for the learning model. The cluster analysis does not work with categorical variables, so we made some transformations in our selected dataset to start the process, replacing the categorical variables with integer values, following the flow of the script functions below.

library(fpc) 
head (alldata) # the final dataset originated from the social media.
data <- alldata 
# Extracting some columns
cdata <- data.frame (data $ type, data $ comments_count, data $ likes_count, data $ user_id) 
# Naming the columns
colnames (cdata) <- c ("type", "comments", "likes", "user_id")
 # Converting to integer (some clustering algorithms only accept numbers as input)
cdata $ user_id <- as.integer (cdata $ user_id)
cdata $ type <- as.integer (cdata $ type) 
# Viewing the data
View (data)
View (cdata)
 # Estimating the number of clusters (automatic)
> clusters <- pamk(cdata)
> n <- clusters$nc
> n
[1] 2

The cluster estimate was only 2 clusters and as the cluster number was less than 4, which and the generally recommended number, we will perform another manual cluster test to calculate the sum of square errors and then compare the results to the final definition of the modeling cluster number.

# Estimating the number of clusters (manual)
# Calculating the sum of square errors
wss <- (nrow (cdata) -1) * sum (apply (cdata, 2, var)) 
# Seeking the sum of square errors within groups
for (i in 2:25) wss [i] <- sum (kmeans (cdata, centers = i) $ withinss)
 # Plot of clusters
# Logically, as the number of clusters increases, the sum of the square errors reduces.
# If there are n objects in a data set, then n clusters would result in error 0, but ideally
# We need to stop at some point. According to theories, the rate of decrease in the sum of errors will
# fall suddenly at one point and that should be considered as the ideal number of clusters.
# According to the graph below, the ideal cluster number is 4.
plot (1:25, wss, type = "b", xlab = "Number of clusters", ylab = "Sum of squares within groups")

The good practices claim that as errors fall and when we have the ideal number of clusters. Soon through the manual calculation, we realized that only 2 clusters would have more errors, so we will choose cluster = 4, because above 4 cluster there is no gain with the reduction of errors.

Em seguida utilizaremos o algoritimo de machine learning nao supervisionado K-means para clusters igual a 4, conforme o codigo abaixo:


> fit <- kmeans(cdata, 4)
> fit$cluster
   [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
  [49] 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
  [97] 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3
 [145] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [193] 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3
 [241] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
 [289] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [337] 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [385] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [433] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [481] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [529] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [577] 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3
 [625] 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [673] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [721] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [769] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [817] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [865] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [913] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [961] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [ reached getOption("max.print") -- omitted 2570 entries ]


> # Number of elements in each cluster
> table(fit$cluster)
   1    2    3    4 
 781  419  570 1800 
> # Cluster averages
> aggregate(cdata, by = list(fit$cluster), FUN = mean)
  Group.1     type   comments       likes    user_id
1       1 1.051216 1835.98592 107161.0871  456825599
2       2 1.016706   10.93795    402.0286 1907204735
3       3 1.026316  522.40000  23406.6228 1409417548
4       4 1.076111 1148.87000 105233.9922   84176799
> #Notes in each cluster
> resultado <- cbind(cdata, clusterNum = fit$cluster)

> library(fpc)
> plotcluster(cdata, fit$cluster)
> dev.copy(png, filename = "clusterPlot.png", width = 600, height = 875);
png 
  4 
> dev.off ();
RStudioGD 
        2 

The figure below shows us the new data set where we can begin new analysis considering the 4 groups.


Through this classification in 4 groups, we could start a research to analyze what would be the similarities of behavior of each user, trying to identify if they belong to the same region, if they have approximate numbers of followers, if they are artists, etc. and then seek insights to support decision makers.

There are countless possibilities to generate important information through data and our mission in this article will present this entire horizon that can enhance your business, as these tools are modern and widely used by leading and innovative companies.


Recommendation systems


Another important activity within the social media analysis process is the possibility of generating recommendations according to the observed standards, so, with the historical database properly organized, we can start the process, where we will seek to make recommendations for followers, from according to the profiles they follow.

Below is the script:

library (data.table)

# Loading the dataset with followers
userfollows <- read.csv ("dataset4-userfollows.csv")
head (userfollows)
names (userfollows)

# We have the preceding variable in the data set. To build the recommendation mechanism and provide
# recommendations to users, we only need two columns. Therefore, we select these two columns
# using the data.frame function:
fdata <- data.frame (userfollows $ users.i..1., userfollows $ username)
colnames (fdata) <- c ("user", "follows")
head (fdata)

# Data pivot
# Now, we have to manipulate the data set in such a way that users become the column and
# the users that follow, become the lines. Thus, it becomes easy to calculate the correlation between users.
# To rotate the data, we need to use the dcast.data.table function, which requires the data.table package.
pivoting <- data.table (fdata)
pivotdata <- dcast.data.table (pivoting, follows ~ user, fun.aggregate = length, value.var = "user")
write.csv (pivotdata, "dataset5-pivot-follows-temp.csv")

# After deleting the index column and the null user

# Reading the data
data <- read.csv ("dataset5-pivot-follows.csv")
head (data)
colnames (date)

# Removing the user column
data.ubs <- (data [,! (names (data)% in% c ("users"))))

# Function that calculates the distance between 2 vectors
# We can calculate the similarity of users using different methods.
# In our case, we will use the cosine similarity technique to obtain the similarity score
# for all pairs of users. In our data set, zero means the user is not following.
# If we consider these lines to be zero, while calculating similarity using correlation or any
# another technique, let's end with a biased exit that is far from reality.
# Thus, when calculating the similarity score, we will consider only non-null lines.
# The following function calculates the similarity between users using the cosine similarity method.
getCosine <- function (x, y)
{
  dat <- cbind (x, y)
  f <- as.data.frame (dat)
  # Remove lines with zeros
  datn <- f [-which (rowSums (f == 0)> 0),]
  if (nrow (datn)> 2)
  {
    this.cosine <- sum (x * y) / (sqrt (sum (x * x)) * sqrt (sum (y * y)))
  }
  else
  {
    this.cosine <- 0
  }
  return (this.cosine)
}

# Now, we need to build a similarity matrix that will tell us how users are
# similar to each other. Before computing the similarity, let's build an empty matrix that can
# be used to store the similarity:
data.ubs.similarity <- matrix (NA,
                              nrow = ncol (data.ubs),
                              ncol = ncol (data.ubs),
                              dimnames = list (colnames (data.ubs),
                                              colnames (data.ubs)))

# Applying the similarity calculation to all columns
# Now we can start replacing the empty cells in the similarity matrix with the score of
# real similarity. In the case of cosine similarity, the range will be from -1 to + 1.
# The following loop will help to calculate the similarity between all users.
# If there is not enough data to calculate the similarity according to our function,
# it will return zero. The print statement in the following loop will help us understand the progress of the loop.
# Depending on the data set, the time required varies.
for (i in 1: ncol (data.ubs)) {
  # Loop on all columns
  for (j in 1: ncol (data.ubs)) {
    # Calculates similarity and feeds the dataframe
    data.ubs.similarity [i, j] <- getCosine (as.matrix (data.ubs [i]), as.matrix (data.ubs [j]))
  }
  print (i)
}

# Convert the similarity matrix to a dataframe
data.ubs.similarity <- as.data.frame (data.ubs.similarity)

# Replace NA with 0
data.ubs.similarity [is.na (data.ubs.similarity)] <- 0
head (data.ubs.similarity)

# Getting each user's 10 neighbors
data.neighbors <- matrix (NA,
                          nrow = ncol (data.ubs.similarity),
                          ncol = 11,
                          dimnames = list (colnames (data.ubs.similarity)))

# Generating recommendations
for (i in 1: ncol (data.ubs))
{
  # Avoiding zero values
  n <- length (data.ubs.similarity [, i])
  thres <- sort (data.ubs.similarity [, i], partial = n-10) [n-10]
  if (thres> 0.020)
  {
    # Selecting Top 10 recommendations
    data.neighours [i,] <- (t (head (n = 11, rownames (data.ubs.similarity [order (data.ubs.similarity [, i], decreasing = TRUE),] [i]))) )
  }
  else
  {
    data.neighbors [i,] <- ""
  }
}

# Viewing recommendations
# In the previous code, we take one user at a time and then rank the similarity score
# of the user with all other users, so the pair with the most similarity comes first.
# So, we stopped just filtering the top 10 for each of the users. This is recommended for us.
# We can see the recommendations given to users:
View (data.neighours)

# Recording recommendations
write.csv(data.neighbours, "Recomendations.csv")

See the dataset in the figure below and note that the system of recommendations, recommended to the followers of Alicia Key, the artist John Legend.

This idea can be replicated for any business problem, generating recommendations for different products and services, since data science is the same.


Conclusion


Therefore, we saw that social network analytics can use the twitter to perform word searches in order to seek insights and communicate more effectively with their customers, leveraging their marketing campaigns in a much more personalized way and, in addition, it allows the automation of posts in the company's twitter account and send recommendations for the customers.


Benefits


  • Miscellaneous market research

  • Support for decision makers

  • Marketing actions more targeted to the target audience

  • Higher return on marketing investment

  • Attracting new potential customers

  • Better customer relationship

  • More customized marketing to consumers

  • New customers



27 views