Reputation: 151
I am working on a report that will display the results of some Likert scale data. I want to use the skim() function from the skimr package to utilize the spark graphs/histogram visual. The issue is that my response options range from 1 to 5 on each question, but some of my questions only collected responses in the 3 to 5 range (response options 1 and 2 were not selected). The histogram shows five columns and the range seems to represent 3, 3.5, 4, 4.5, 5 rather than from 1 to 5. How do I tell skimr to display option 1 through 5? Thanks for any help in advance.
Example:
Data:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
1 3 3 3 1 3 4 4
5 5 5 4 2 5 5 5
5 5 5 5 5 5 5 5
5 5 5 4 2 5 5 5
5 5 5 4 2 5 5 5
I use the following code:
skim(Data)
I want the historgrams ("hist"column) to show Reponses 1 through 5. But for variables 2,3,4, 6,7,8 it is only showing values of 3 or 4 through 5. Is there any way to adjust this?
Upvotes: 2
Views: 445
Reputation: 4949
You seem to have a bit of a misconception.
Let's take your unchanged data in the form of tibble
and put it in the skim
function.
library(tidyverse)
library(skimr)
df = read.table(
header = TRUE,text="
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
1 3 3 3 1 3 4 4
5 5 5 4 2 5 5 5
5 5 5 5 5 5 5 5
5 5 5 4 2 5 5 5
5 5 5 4 2 5 5 5
") %>% as_tibble()
df %>% skim()
We get this on the output
-- Data Summary ------------------------
Values
Name Piped data
Number of rows 5
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None
-- Variable type: numeric ---------------------------------------------------------------------------------------------
# A tibble: 8 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Var1 0 1 4.2 1.79 1 5 5 5 5 ▂▁▁▁▇
2 Var2 0 1 4.6 0.894 3 5 5 5 5 ▂▁▁▁▇
3 Var3 0 1 4.6 0.894 3 5 5 5 5 ▂▁▁▁▇
4 Var4 0 1 4 0.707 3 4 4 4 5 ▂▁▇▁▂
5 Var5 0 1 2.4 1.52 1 2 2 2 5 ▂▇▁▁▂
6 Var6 0 1 4.6 0.894 3 5 5 5 5 ▂▁▁▁▇
7 Var7 0 1 4.8 0.447 4 5 5 5 5 ▂▁▁▁▇
8 Var8 0 1 4.8 0.447 4 5 5 5 5 ▂▁▁▁▇
However, you do write that your data is on the Likert scale. And for such data it makes no sense to count the mean, standard deviation, etc. because what does it mean that the average for the variable Var1
is 4.2? I can't interpret it.
Then we have to mutate all variables to the factor type.
df %>% mutate_all(~factor(., 1:5)) %>% skim()
output
-- Data Summary ------------------------
Values
Name Piped data
Number of rows 5
Number of columns 8
_______________________
Column type frequency:
factor 8
________________________
Group variables None
-- Variable type: factor ----------------------------------------------------------------------------------------------
# A tibble: 8 x 6
skim_variable n_missing complete_rate ordered n_unique top_counts
* <chr> <int> <dbl> <lgl> <int> <chr>
1 Var1 0 1 FALSE 2 5: 4, 1: 1, 2: 0, 3: 0
2 Var2 0 1 FALSE 2 5: 4, 3: 1, 1: 0, 2: 0
3 Var3 0 1 FALSE 2 5: 4, 3: 1, 1: 0, 2: 0
4 Var4 0 1 FALSE 3 4: 3, 3: 1, 5: 1, 1: 0
5 Var5 0 1 FALSE 3 2: 3, 1: 1, 5: 1, 3: 0
6 Var6 0 1 FALSE 2 5: 4, 3: 1, 1: 0, 2: 0
7 Var7 0 1 FALSE 2 5: 4, 4: 1, 1: 0, 2: 0
8 Var8 0 1 FALSE 2 5: 4, 4: 1, 1: 0, 2: 0
It makes a little more sense now. It can be seen that for the variable Var1
we have 4 answers 5
, one answer 1
and zero remaining, regardless of what the answer type 5
means.
However, there are no histograms now. Well, we can easily produce them ourselves.
df %>% mutate_all(~factor(., 1:5)) %>%
pivot_longer(everything()) %>%
ggplot(aes(value))+
geom_histogram(stat="count")+
facet_grid(rows=vars(name))
Finally, a little hint. When working with data, call it more meaningful. Enter the same values according to your scale. So I changed your variables a bit to questions and the answer values to the following levels "definitely yes, yes, I don't know, no, definitely not".
df = read.table(
header = TRUE,text="
Question1 Question2 Question3 Question4 Question5 Question6 Question7 Question8
def.not don't.know don't.know don't.know def.not don't.know yes yes
def.yes def.yes def.yes yes not def.yes def.yes def.yes
def.yes def.yes def.yes def.yes def.yes def.yes def.yes def.yes
def.yes def.yes def.yes yes not def.yes def.yes def.yes
def.yes def.yes def.yes yes not def.yes def.yes def.yes
") %>% as_tibble() %>% mutate_all(~factor(., c("def.not", "not", "don't.know", "yes", "def.yes")))
output
# A tibble: 5 x 8
Question1 Question2 Question3 Question4 Question5 Question6 Question7 Question8
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 def.not don't.know don't.know don't.know def.not don't.know yes yes
2 def.yes def.yes def.yes yes not def.yes def.yes def.yes
3 def.yes def.yes def.yes def.yes def.yes def.yes def.yes def.yes
4 def.yes def.yes def.yes yes not def.yes def.yes def.yes
5 def.yes def.yes def.yes yes not def.yes def.yes def.yes
Now your histogram will be much clearer, don't you think?
df %>% pivot_longer(everything()) %>%
ggplot(aes(value))+
geom_histogram(stat="count")+
facet_grid(rows=vars(name))
Upvotes: 2