Introduction

The Likert scale is very commonly chosen for measuring the degree of agreement with statements in questionaires. We can use the example of Likert scale data to look at a range of technical aspects regarding the way data is handled in R.

Packages used

## Load the packages
library(dplyr)
library(aqm)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)

Simulating a single vector of Likert data

To simulate responses from the Likert scale we can write a function that takes the number of responses and a vector of proportion responding in each class. The following chunk builds a function that will by default return 1000 responses, which tend to be positive.

rand_likert<-function(n=1000, p=c(1,2,3,5,10))
{
  ## Form a vector of responses
  lscale<-c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree")
  ## Sample 1000 times with replacement and unequal probability
  lik<-sample(lscale,n,replace=TRUE,prob=p)
  ## Turn the response into an ordered factor
  lik<-factor(lik,lscale,ordered=TRUE) 
  lik
}

Notice that the function works in the following way ..

  1. The “lscale” object is a vector of categorical responses.
  2. The second command samples responses from this vector with a probability given by the numbers in a second vector. If this is not provided then all the values have an equal probability of being chosen. So we can simulate responses that tend to be positive by providing a vecotr with high values on the right hand side (as in the default for the function)
  3. The result of sampling is a character vector. We can form the character vector into an ordered factor by providing a list of values in order. The lscale vector is such a list, so the function returned an ordered factor.

The function has been included in the aqm package on the server.

Character vectors and factors

This example provides another opportunity to explain the difference between character vectors and factors. The function returns an ordered factor

q1<-rand_likert(100)
str(q1)
##  Ord.factor w/ 5 levels "Strongly disagree"<..: 4 3 2 4 5 3 4 5 5 4 ...

The vector actually consists of numerical values between 1 and 5 with the factor levels as labels. If we coerce the factor to a character vector then the vector will just contains text.

q1<-as.character(q1)
str(q1)
##  chr [1:100] "Agree" "Neutral" "Disagree" "Agree" "Strongly agree" ...

It is important to be aware that if data held in the form of text is read into R using read.csv then R will silently coerce the character vector to a factor. This behavious differs from that of the read_csv function in the readr package which keeps the data as a character vector unless other instructions are provided.

Notice that if we coerce a character vector for Likert scale text into a factor it will not automatically be in the order we would like. The default order is simply the alphabetical order.

q1<-as.factor(q1)
str(q1)
##  Factor w/ 5 levels "Agree","Disagree",..: 1 3 2 1 4 3 1 4 4 1 ...
levels(q1)
## [1] "Agree"             "Disagree"          "Neutral"          
## [4] "Strongly agree"    "Strongly disagree"

So, to produce an ordered factor we need to look at the levels of the factor which are produced by default and place them in a more logical order. This step requires a little care.

q1<-factor(q1,levels(q1)[c(5,2,3,1,4)],ordered=TRUE) 
str(q1)
##  Ord.factor w/ 5 levels "Strongly disagree"<..: 4 3 2 4 5 3 4 5 5 4 ...

Forming a data frame

Let’s now simulate the responses of one thousand people to four questions. Notice that the first question takes the default vector of probabilities. The second question provokes more negative responses. The third question is more neutral and the final question has a “marmite” type response.

q1<-rand_likert()
q2<-rand_likert(p=c(4,3,1,1,1))
q3<-rand_likert(p=c(1,2,5,2,1))
q4<-rand_likert(p=c(5,1,1,1,4))

d<-data.frame(id=1:1000,q1,q2,q3,q4)

Looking at the raw data..

dt(d)

This produces a wide data frame. This is likely to represent the way data were entered in a spreadsheet or other tabular data capturing software. It is not wrong in any way to represent data in this form. Some analyses need data in a wide format. One example would be to use base R to tabulate each response.

table(d$q1)
## 
## Strongly disagree          Disagree           Neutral             Agree 
##                40                93               146               236 
##    Strongly agree 
##               485

Or to use qqplot to produce a barplot for a single question

ggplot(d,aes(x=q2)) + geom_bar() +coord_flip()

However this can get bit tedious and repetetive. The four columns holding the responses to questions actually should be thought of as a matrix. A true data frame would consist of only three columns. The id of the individual providing the response, a column holding the code for the question asked and a third column containing the response.

df<-pivot_longer(d, cols=-id,names_to="question",values_to = "response")
dt(df)

Now we can tabulate all the data with one command.

df %>% group_by(question,response) %>% summarise(n=n()) %>% mutate(percent = round(100*n / sum(n),0)) -> tb
dt(tb)

Adding question text

In order to collect real data the survey will be based on questions which may frequently consist of longish sentences. They may also be grouped by theme.

The easiest way to handle questionaire data of this type is to use a relational data structure in which data corresponding to details on the question content are held in an additional table that can be linked to the responses and data on the individuals who provided the responses held in another table. The tables can then be linked together.

For example, the text of the

library(aqm)
data(nss_questions)
dt(nss_questions)
df<-dplyr::inner_join(df,nss_questions)
tb<-dplyr::inner_join(tb,nss_questions)
ggplot(data=tb,aes(x=response,y=percent,label=percent)) + facet_wrap('short_text') +
  geom_bar(stat="identity") + 
  geom_label()  + 
  coord_flip()

#theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Remove coord flip and comment if you want vertical bars with labels at the bottom.
ggplot(tb) + 
  aes(x = short_text, y = percent, fill = response, label=percent) +
   geom_bar(stat="identity", width = 0.5)+ 
  coord_flip() +
   geom_text(aes(label = percent), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(
    values = colorRampPalette(
      RColorBrewer::brewer.pal(n = 5, name = "RdBu"))(6), 
                    guide = guide_legend(reverse = TRUE)) 

set.seed(1)

num_likert<-function(x){
lscale<-c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree")
x<-as.factor(x)
levels(x)<-lscale
x
}

lik_cor<-function(n=500,cor=0.7, mu=c(1,1,1,-1)){
nq<-length(mu)
mat <- matrix(cor, nrow = nq, ncol = nq)
diag(mat) <- 1
d<-  mvrnorm(n=n, mu=mu, Sigma=mat, empirical=TRUE)
d<-round(d,0) +3
d[d>5]<-5
d[d<1]<-1
d<-as.data.frame(d)
d<-mutate_all(d,num_likert)
d
}

d1<-lik_cor()
d2<-lik_cor(cor=0.5)
d3<-lik_cor(cor=0.4)
d<-data.frame(d1,d2,d3)
names(d)<-paste("q",1:12)
dd<-d<-mutate_all(d,as.numeric)

rank_cor <- cor(dd,method="spearman")
cor.plot(rank_cor, numbers=T, upper=FALSE, main = "Rank Correlation", show.legend = FALSE)

fa.parallel(d, fm="minres", fa="fa", cor='poly', main = "Polychoric scree Plot")
## Warning in matpLower(x, nvar, gminx, gmaxx, gminy, gmaxy): 66 cells were
## adjusted for 0 values using the correction for continuity. Examine your
## data carefully.

## Parallel analysis suggests that the number of factors =  3  and the number of components =  NA
mod<-fa(d, nfactor=4, cor="poly", fm="mle", rotate = "oblimin")
## Warning in matpLower(x, nvar, gminx, gmaxx, gminy, gmaxy): 66 cells were
## adjusted for 0 values using the correction for continuity. Examine your
## data carefully.
mod$loadings
## 
## Loadings:
##      ML1    ML3    ML2    ML4   
## q 1   0.826                     
## q 2   0.805                     
## q 3   0.800                     
## q 4   0.738        -0.154  0.105
## q 5                 0.652       
## q 6                 0.556  0.152
## q 7                 0.604  0.173
## q 8                 0.120  0.789
## q 9          0.598  0.254 -0.121
## q 10         0.627              
## q 11         0.671 -0.153  0.120
## q 12         0.569 -0.140       
## 
##                  ML1   ML3   ML2   ML4
## SS loadings    2.521 1.531 1.261 0.731
## Proportion Var 0.210 0.128 0.105 0.061
## Cumulative Var 0.210 0.338 0.443 0.504
fa.diagram(mod)