In August 2016 a data anayst made the news by proving that Donald Trump’s Twitter account was being used by at least two separate people. The story made the Washington post. http://varianceexplained.org/r/trump-tweets/
Today, Donald Trump tweeted some complementary comments about women (on international women’s day). http://www.politico.com/story/2017/03/trump-womens-day-tweet-235814 The problem is that the nice comments all came from an IPhone. David Robinson’s analysis showed that Trump himself uses an Android device.
David Robinson has claimed that he is no longer analysing Trump’s tweets due to the adverse reaction he provoked. However no one is going to notice my reanalysis, so I pulled down his code, tweaked it a little, and re ran it to see if Trump and his team have swopped devices.
library(readr)
trump_tweets_df<-read_csv("trump_tweets.csv")
I pulled down the maximum of 3200 tweets, but decided to only use those written this year. That includes a few pre-inauguration tweets for good measure, when Trump may have been feeling more optimistic about things to come.
library(tidyr)
tweets <- trump_tweets_df %>%
select(id, statusSource, text, created) %>%
extract(statusSource, "source", "Twitter for (.*?)<") %>%
filter(source %in% c("iPhone", "Android"))
The key indicator that seemed to confirm that two devices were being used was the time of day on which the tweets were sent. Early in the day angry tweets were sent from an android device during the campaign. More measured tweets came from an Iphone later on. Is this still happening?
library(lubridate)
library(scales)
tweets %>%
count(source, hour = hour(with_tz(created, "EST"))) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(hour, percent, color = source)) +
geom_line() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)",
y = "% of tweets",
color = "")
It certainly appears so!
library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
It is interesting to look at the raw vocabulary being used.
tweet_words %>%
count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity") +
ylab("Occurrences") +
coord_flip()
Donald Trump’s android device is used to rant about the press.
android_iphone_ratios <- tweet_words %>%
count(word, source) %>%
filter(sum(n) >= 5) %>%
spread(source, n, fill = 0) %>%
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
mutate(logratio = log2(Android / iPhone)) %>%
arrange(desc(logratio))
android_iphone_ratios %>%
group_by(logratio > 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Android / iPhone log ratio") +
scale_fill_manual(name = "", labels = c("Android", "iPhone"),
values = c("red", "lightblue"))
What sort of sentiments are used in the tweets and is there any relationship between the sentiment and the device of origin?
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word, sentiment)
sources <- tweet_words %>%
group_by(source) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(id, source, total_words)
by_source_sentiment <- tweet_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, id) %>%
ungroup() %>%
complete(sentiment, id, fill = list(n = 0)) %>%
inner_join(sources) %>%
group_by(source, sentiment, total_words) %>%
summarize(words = sum(n)) %>%
arrange(-words) %>%
ungroup()
dd<-head(by_source_sentiment,30)
dd
## # A tibble: 20 x 4
## source sentiment total_words words
## <chr> <chr> <int> <dbl>
## 1 Android positive 3538 447
## 2 Android negative 3538 414
## 3 iPhone positive 2700 317
## 4 Android trust 3538 291
## 5 Android anticipation 3538 234
## 6 Android fear 3538 232
## 7 Android sadness 3538 231
## 8 Android anger 3538 228
## 9 iPhone trust 2700 225
## 10 iPhone anticipation 2700 192
## 11 Android joy 3538 174
## 12 Android disgust 3538 167
## 13 iPhone negative 2700 165
## 14 iPhone joy 2700 131
## 15 iPhone fear 2700 119
## 16 Android surprise 3538 111
## 17 iPhone anger 2700 88
## 18 iPhone surprise 2700 86
## 19 iPhone sadness 2700 78
## 20 iPhone disgust 2700 54
So 447 of the 3538 words in the Android tweets were associated with “anger”). To measure how much more likely the Android account is to use an emotionally-charged term relative to the iPhone account a Poisson test can be used.
library(broom)
sentiment_differences <- by_source_sentiment %>%
group_by(sentiment) %>%
do(tidy(poisson.test(.$words, .$total_words)))
library(scales)
sentiment_differences %>%
ungroup() %>%
mutate(sentiment = reorder(sentiment, estimate)) %>%
mutate_each(funs(. - 1), estimate, conf.low, conf.high) %>%
ggplot(aes(estimate, sentiment)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
scale_x_continuous(labels = percent_format()) +
labs(x = "% increase in Android relative to iPhone",
y = "Sentiment")
android_iphone_ratios %>%
inner_join(nrc, by = "word") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate(sentiment = reorder(sentiment, -logratio),
word = reorder(word, -logratio)) %>%
group_by(sentiment) %>%
top_n(10, abs(logratio)) %>%
ungroup() %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
facet_wrap(~ sentiment, scales = "free", nrow = 2) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "", y = "Android / iPhone log ratio") + coord_flip()+
scale_fill_manual(name = "", labels = c("Android", "iPhone"),
values = c("red", "lightblue"))
Notice that one of the accounts does not mention of America nor of jobs. In fact Russia figures larger in Trunp’s android account. The IPhone account is much more positive and focussed on political issues that may matter to the electorate.
tweet_words %>%
group_by(word,source) %>%
count() ->allwords
words<-subset(allwords,source=="Android")
wordcloud(words$word, words$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
words<-subset(allwords,source=="iPhone")
wordcloud(words$word, words$n, random.order=FALSE, max.words = 100, colors=brewer.pal(8, "Dark2"), use.r.layout=TRUE)
https://bitbucket.org/dgolicher/aqm2017/raw/a0a96bdb972804b78011944d2432e20a343434b6/TrumpTweets.Rmd