Chapter 23 Simple text processing with sentiment analysis
23.1 Introduction
This is a very simple example of the sort of work flow involved with text processing. As an example I have used the text from this page obtained by googling nature and bereavement.
http://journeyofhearts.org/healing/nature.html
The page has been cut and pasted into a text file.
library(wordcloud)
library(dplyr)
library(tidyr)
library(scales)
library(stringr)
library (readr)
library(tidytext)
options(scipen=999)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
23.2 Reading in the data
The data from a simple text file can be read in using read_lines. Blank lines are then filtered out and the factor coerced into a character vector.
d<-data.frame(text=read_lines("nature_healing.txt"))
d %>% filter(text != "") %>% mutate(text=as.character(text))->d
DT::datatable(d)
23.3 Making a data frame consisting of just words
The tidytext package has functions to extract the words (tokens) and to remove stop words.
library(SnowballC)
data(stop_words)
d %>% unnest_tokens(word, text) %>% anti_join(stop_words) -> words
23.4 Count the frequenciy of each word
### Count the frequencies
words %>%
group_by(word) %>%
count() %>% arrange(-n) ->word_count
## Show as table
DT::datatable(word_count)
23.5 Word cloud
Making a word cloud from the table of frequencies is easy using the word-cloud package.
wordcloud(word_count$word, word_count$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
23.6 Find the sentiments associated with the words
This is the part that may not quite work as well as you might hope. There are various lexicons in the R package tidytext. These lexicons are simple tables with words and associated emotions, or scores. Some words can have several emotions associated with them.
# Get Lexicon
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word, sentiment)
## Join to words
words %>% inner_join(nrc, by = "word") -> word_sentiment
DT::datatable(word_sentiment)
23.7 Plotting the frequencies of the sentiments
library(ggplot2)
word_sentiment %>%
group_by(sentiment) %>%
summarise(n=n()) %>%
arrange(n) %>%
mutate(sentiment = reorder(sentiment, n)) ->ws
ggplot(data=ws,aes(x=sentiment,y=n,label=n)) +
geom_bar(stat="identity") +
geom_label() +
coord_flip()
23.8 Words associated with each sentiment
word_sentiment %>%
group_by(sentiment,word) %>%
count() %>%
filter(n>3) %>%
arrange(-n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x=word,y=n)) +
geom_bar(stat="identity",fill="red") +
facet_wrap(~ sentiment, scales = "free", ncol = 5) +
coord_flip()
23.9 Used in a questionaire context
Shorter pieces of text can be scored for usage of positive and negative terms in order to be used as a response variable in an statistical analysis. However be aware of the potential weaknesses of the lexicon used, and the potential for mis-scoring, especially when the terms are negated or taken out of context.
23.10 Example: One line per tweet
The word tokens are extracted from a data frame along with the covariates in the rows that contain text. So if shorter pieces of text are scored numerically the relationship between the scores and other covariates can be looked at. For example, here is the text of some of Donald Trump’s recent tweets along with the number of times his followers have favourited them. Each tweet can be scored as being positive or negative.
d<-read_csv("recent_tweets.csv")
d$text<-as.character(d$text)
d %>% select(id,text,favoriteCount,created, hour) ->d
DT::datatable(d)
23.11 Extracting the words
data(stop_words)
d$text2<-d$text
d %>% select(hour, text2, id, favoriteCount,text) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) -> words
23.12 Sentiment scores
This time using the affin lexicon. This produces scores between -4 and +4 for positive and negative
afin<-get_sentiments("afinn")
words %>%
inner_join(afin, by = "word") -> word_score
23.13 Is there a relationship between the scores and the number of times the tweet is favourited?
word_score %>%
group_by(id,text2) %>% summarise(n=n(),score=mean(score),favourited=mean(favoriteCount)) %>% ggplot(aes(x=score,y=favourited, label =text2)) +
geom_point() +
geom_smooth() ->g1
library(plotly)
ggplotly(g1)
The answer seems to be no. Donald Trump’s twitter followers don’t care much whether he expresses positive or negative sentiments. By plotting out the results using plotly at least some of the text can be seen by hovering on the tweet. Trump scored the most favourite counts when he tweeted “Merry Christmas”, which scored quite highly on the positivity index!
23.14 Using Udpipe
The Udpipe package has some powerful features for breaking down text into parts of speach.
There are some ideas in this tutorial.
https://towardsdatascience.com/easy-text-analysis-on-abc-news-headlines-b434e6e3b5b8
library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ewt-ud-2.3-181115.udpipe')
dd<-udpipe_annotate(udmodel_english, words$word)
dd<-data.frame(dd)
23.15 Nouns
dd %>% group_by(upos,token) %>%
filter(upos=="NOUN") %>%
summarise(n=n()) %>%
arrange(-n) -> nouns
wordcloud(nouns$token, nouns$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
23.16 Verbs
dd %>% group_by(upos,token) %>%
filter(upos=="VERB") %>%
summarise(n=n()) %>%
arrange(-n) %>%
filter(token!="republicans")-> verbs
wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
23.17 Adjectives
dd %>% group_by(upos,token) %>%
filter(upos=="ADJ") %>%
summarise(n=n()) %>% arrange(-n) %>%
filter(token!="southern")-> verbs
wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)