Chapter 23 Simple text processing with sentiment analysis

23.1 Introduction

This is a very simple example of the sort of work flow involved with text processing. As an example I have used the text from this page obtained by googling nature and bereavement.

http://journeyofhearts.org/healing/nature.html

The page has been cut and pasted into a text file.

library(wordcloud)
library(dplyr)
library(tidyr)
library(scales)
library(stringr)
library (readr)
library(tidytext)

options(scipen=999)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

23.2 Reading in the data

The data from a simple text file can be read in using read_lines. Blank lines are then filtered out and the factor coerced into a character vector.

d<-data.frame(text=read_lines("nature_healing.txt"))

d %>% filter(text != "") %>% mutate(text=as.character(text))->d

DT::datatable(d)

23.3 Making a data frame consisting of just words

The tidytext package has functions to extract the words (tokens) and to remove stop words.

library(SnowballC)
data(stop_words)
d %>% unnest_tokens(word, text)  %>% anti_join(stop_words) -> words

23.4 Count the frequenciy of each word

### Count the frequencies

words %>%
  group_by(word) %>%
  count() %>% arrange(-n) ->word_count

## Show as table
DT::datatable(word_count)

23.5 Word cloud

Making a word cloud from the table of frequencies is easy using the word-cloud package.

 wordcloud(word_count$word, word_count$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)

23.6 Find the sentiments associated with the words

This is the part that may not quite work as well as you might hope. There are various lexicons in the R package tidytext. These lexicons are simple tables with words and associated emotions, or scores. Some words can have several emotions associated with them.

# Get Lexicon
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)

## Join to words
words %>% inner_join(nrc, by = "word") -> word_sentiment
DT::datatable(word_sentiment)

23.7 Plotting the frequencies of the sentiments

library(ggplot2)
word_sentiment %>% 
  group_by(sentiment) %>% 
  summarise(n=n()) %>% 
  arrange(n) %>% 
  mutate(sentiment = reorder(sentiment, n)) ->ws

ggplot(data=ws,aes(x=sentiment,y=n,label=n)) + 
  geom_bar(stat="identity") + 
  geom_label()  + 
  coord_flip()

23.8 Words associated with each sentiment

word_sentiment %>% 
  group_by(sentiment,word) %>% 
  count() %>% 
  filter(n>3) %>% 
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(x=word,y=n)) +
  geom_bar(stat="identity",fill="red") +
  facet_wrap(~ sentiment, scales = "free", ncol = 5) +
  coord_flip()

23.9 Used in a questionaire context

Shorter pieces of text can be scored for usage of positive and negative terms in order to be used as a response variable in an statistical analysis. However be aware of the potential weaknesses of the lexicon used, and the potential for mis-scoring, especially when the terms are negated or taken out of context.

23.10 Example: One line per tweet

The word tokens are extracted from a data frame along with the covariates in the rows that contain text. So if shorter pieces of text are scored numerically the relationship between the scores and other covariates can be looked at. For example, here is the text of some of Donald Trump’s recent tweets along with the number of times his followers have favourited them. Each tweet can be scored as being positive or negative.

d<-read_csv("recent_tweets.csv")

d$text<-as.character(d$text)
d %>% select(id,text,favoriteCount,created, hour) ->d
DT::datatable(d)

23.11 Extracting the words

data(stop_words)
d$text2<-d$text
d %>% select(hour, text2, id, favoriteCount,text) %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) -> words

23.12 Sentiment scores

This time using the affin lexicon. This produces scores between -4 and +4 for positive and negative

afin<-get_sentiments("afinn")
words %>% 
  inner_join(afin, by = "word") -> word_score

23.13 Is there a relationship between the scores and the number of times the tweet is favourited?

word_score %>% 
  group_by(id,text2) %>% summarise(n=n(),score=mean(score),favourited=mean(favoriteCount)) %>% ggplot(aes(x=score,y=favourited, label =text2)) + 
  geom_point() + 
  geom_smooth() ->g1
library(plotly)
ggplotly(g1)

The answer seems to be no. Donald Trump’s twitter followers don’t care much whether he expresses positive or negative sentiments. By plotting out the results using plotly at least some of the text can be seen by hovering on the tweet. Trump scored the most favourite counts when he tweeted “Merry Christmas”, which scored quite highly on the positivity index!

23.14 Using Udpipe

The Udpipe package has some powerful features for breaking down text into parts of speach.

There are some ideas in this tutorial.

https://towardsdatascience.com/easy-text-analysis-on-abc-news-headlines-b434e6e3b5b8

library(udpipe)

#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ewt-ud-2.3-181115.udpipe')
dd<-udpipe_annotate(udmodel_english, words$word)
dd<-data.frame(dd)

23.15 Nouns

dd %>% group_by(upos,token) %>% 
  filter(upos=="NOUN") %>% 
  summarise(n=n()) %>% 
  arrange(-n) -> nouns

 wordcloud(nouns$token, nouns$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)

23.16 Verbs

dd %>% group_by(upos,token) %>% 
  filter(upos=="VERB") %>% 
  summarise(n=n()) %>% 
  arrange(-n) %>% 
  filter(token!="republicans")-> verbs

 wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)

23.17 Adjectives

dd %>% group_by(upos,token) %>% 
  filter(upos=="ADJ") %>% 
  summarise(n=n()) %>% arrange(-n) %>% 
  filter(token!="southern")-> verbs

 wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)