This is a very simple example of the sort of work flow involved with text processing. As an example I have used the text from this page obtained by googling nature and bereavement.
http://journeyofhearts.org/healing/nature.html
The page has been cut and pasted into a text file.
library(wordcloud)
library(dplyr)
library(tidyr)
library(scales)
library(stringr)
library (readr)
library(tidytext)
options(scipen=999)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
The data from a simple text file can be read in using read_lines. Blank lines are then filtered out and the factor coerced into a character vector.
d<-data.frame(text=read_lines("nature_healing.txt"))
d %>% filter(text != "") %>% mutate(text=as.character(text))->d
DT::datatable(d)
The tidytext package has functions to extract the words (tokens) and to remove stop words.
library(SnowballC)
data(stop_words)
d %>% unnest_tokens(word, text) %>% anti_join(stop_words) -> words
### Count the frequencies
words %>%
group_by(word) %>%
count() %>% arrange(-n) ->word_count
## Show as table
DT::datatable(word_count)
Making a word cloud from the table of frequencies is easy using the word-cloud package.
wordcloud(word_count$word, word_count$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
This is the part that may not quite work as well as you might hope. There are various lexicons in the R package tidytext. These lexicons are simple tables with words and associated emotions, or scores. Some words can have several emotions associated with them.
# Get Lexicon
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word, sentiment)
## Join to words
words %>% inner_join(nrc, by = "word") -> word_sentiment
DT::datatable(word_sentiment)
library(ggplot2)
word_sentiment %>% group_by(sentiment) %>% summarise(n=n()) %>% arrange(n) %>% mutate(sentiment = reorder(sentiment, n)) ->ws
ggplot(data=ws,aes(x=sentiment,y=n,label=n)) + geom_bar(stat="identity") + geom_label() + coord_flip()
word_sentiment %>% group_by(sentiment,word) %>% count() %>% filter(n>3) %>% arrange(-n) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x=word,y=n)) +geom_bar(stat="identity",fill="red") +facet_wrap(~ sentiment, scales = "free", ncol = 5) +coord_flip()
Shorter pieces of text can be scored for usage of positive and negative terms in order to be used as a response variable in an statistical analysis. However be aware of the potential weaknesses of the lexicon used, and the potential for mis-scoring, especially when the terms are negated or taken out of context.
The word tokens are extracted from a data frame along with the covariates in the rows that contain text. So if shorter pieces of text are scored numerically the relationship between the scores and other covariates can be looked at. For example, here is the text of some of Donald Trump’s recent tweets along with the number of times his followers have favourited them. Each tweet can be scored as being positive or negative.
d<-read_csv("/home/aqm/data/trump_recent_tweets.csv")
d$text<-as.character(d$text)
d %>% select(id,text,favoriteCount,created, hour) ->d
DT::datatable(d)
data(stop_words)
d$text2<-d$text
d %>% select(hour, text2, id, favoriteCount,text) %>% unnest_tokens(word, text) %>% anti_join(stop_words) -> words
This time using the affin lexicon. This produces scores between -4 and +4 for positive and negative
afin<-get_sentiments("afinn")
words %>% inner_join(afin, by = "word") -> word_score
word_score %>% group_by(id,text2) %>% summarise(n=n(),score=mean(score),favourited=mean(favoriteCount)) %>% ggplot(aes(x=score,y=favourited, label =text2)) + geom_point() + geom_smooth() ->g1
library(plotly)
ggplotly(g1)
The answer seems to be no. Donald Trump’s twitter followers don’t care much whether he expresses positive or negative sentiments. By plotting out the results using plotly at least some of the text can be seen by hovering on the tweet. Trump scored the most favourite counts when he tweeted “Merry Christmas”, which scored quite highly on the positivity index!
The Udpipe package has some powerful features for breaking down text into parts of speach.
There are some ideas in this tutorial.
https://towardsdatascience.com/easy-text-analysis-on-abc-news-headlines-b434e6e3b5b8
library(udpipe)
#model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = 'english-ewt-ud-2.3-181115.udpipe')
dd<-udpipe_annotate(udmodel_english, words$word)
dd<-data.frame(dd)
dd %>% group_by(upos,token) %>% filter(upos=="NOUN") %>% summarise(n=n()) %>% arrange(-n) -> nouns
wordcloud(nouns$token, nouns$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
dd %>% group_by(upos,token) %>% filter(upos=="VERB") %>% summarise(n=n()) %>% arrange(-n) %>% filter(token!="republicans")-> verbs
wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
dd %>% group_by(upos,token) %>% filter(upos=="ADJ") %>% summarise(n=n()) %>% arrange(-n) %>% filter(token!="southern")-> verbs
wordcloud(verbs$token, verbs$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)