JeffB
JeffB

Reputation: 151

Adjusting spark graphs/histograms in skimr package using R

I am working on a report that will display the results of some Likert scale data. I want to use the skim() function from the skimr package to utilize the spark graphs/histogram visual. The issue is that my response options range from 1 to 5 on each question, but some of my questions only collected responses in the 3 to 5 range (response options 1 and 2 were not selected). The histogram shows five columns and the range seems to represent 3, 3.5, 4, 4.5, 5 rather than from 1 to 5. How do I tell skimr to display option 1 through 5? Thanks for any help in advance.

Example:

Data:

Var1 Var2   Var3    Var4    Var5    Var6    Var7  Var8
1     3      3       3      1        3       4       4
5     5      5       4      2        5       5       5
5     5      5       5      5        5       5       5
5     5      5       4      2        5       5       5
5     5      5       4      2        5       5       5

I use the following code:

skim(Data)

I want the historgrams ("hist"column) to show Reponses 1 through 5. But for variables 2,3,4, 6,7,8 it is only showing values of 3 or 4 through 5. Is there any way to adjust this?

Upvotes: 2

Views: 445

Answers (1)

Marek Fiołka
Marek Fiołka

Reputation: 4949

You seem to have a bit of a misconception.
Let's take your unchanged data in the form of tibble and put it in the skim function.

library(tidyverse)
library(skimr)

df = read.table(
  header = TRUE,text="
Var1 Var2   Var3    Var4    Var5    Var6    Var7  Var8
1     3      3       3      1        3       4       4
5     5      5       4      2        5       5       5
5     5      5       5      5        5       5       5
5     5      5       4      2        5       5       5
5     5      5       4      2        5       5       5
") %>% as_tibble() 


df %>% skim()

We get this on the output

-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             5         
Number of columns          8         
_______________________              
Column type frequency:               
  numeric                  8         
________________________             
Group variables            None      

-- Variable type: numeric ---------------------------------------------------------------------------------------------
# A tibble: 8 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Var1                  0             1   4.2 1.79      1     5     5     5     5 ▂▁▁▁▇
2 Var2                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
3 Var3                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
4 Var4                  0             1   4   0.707     3     4     4     4     5 ▂▁▇▁▂
5 Var5                  0             1   2.4 1.52      1     2     2     2     5 ▂▇▁▁▂
6 Var6                  0             1   4.6 0.894     3     5     5     5     5 ▂▁▁▁▇
7 Var7                  0             1   4.8 0.447     4     5     5     5     5 ▂▁▁▁▇
8 Var8                  0             1   4.8 0.447     4     5     5     5     5 ▂▁▁▁▇

However, you do write that your data is on the Likert scale. And for such data it makes no sense to count the mean, standard deviation, etc. because what does it mean that the average for the variable Var1 is 4.2? I can't interpret it.
Then we have to mutate all variables to the factor type.

df %>% mutate_all(~factor(., 1:5)) %>% skim()

output

-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             5         
Number of columns          8         
_______________________              
Column type frequency:               
  factor                   8         
________________________             
Group variables            None      

-- Variable type: factor ----------------------------------------------------------------------------------------------
# A tibble: 8 x 6
  skim_variable n_missing complete_rate ordered n_unique top_counts            
* <chr>             <int>         <dbl> <lgl>      <int> <chr>                 
1 Var1                  0             1 FALSE          2 5: 4, 1: 1, 2: 0, 3: 0
2 Var2                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
3 Var3                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
4 Var4                  0             1 FALSE          3 4: 3, 3: 1, 5: 1, 1: 0
5 Var5                  0             1 FALSE          3 2: 3, 1: 1, 5: 1, 3: 0
6 Var6                  0             1 FALSE          2 5: 4, 3: 1, 1: 0, 2: 0
7 Var7                  0             1 FALSE          2 5: 4, 4: 1, 1: 0, 2: 0
8 Var8                  0             1 FALSE          2 5: 4, 4: 1, 1: 0, 2: 0

It makes a little more sense now. It can be seen that for the variable Var1 we have 4 answers 5, one answer 1 and zero remaining, regardless of what the answer type 5 means.
However, there are no histograms now. Well, we can easily produce them ourselves.

df %>% mutate_all(~factor(., 1:5)) %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(value))+
  geom_histogram(stat="count")+
  facet_grid(rows=vars(name))

enter image description here

Finally, a little hint. When working with data, call it more meaningful. Enter the same values according to your scale. So I changed your variables a bit to questions and the answer values to the following levels "definitely yes, yes, I don't know, no, definitely not".

df = read.table(
  header = TRUE,text="
Question1 Question2   Question3    Question4    Question5    Question6    Question7  Question8
def.not     don't.know      don't.know       don't.know      def.not        don't.know          yes          yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes       def.yes      def.yes        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
def.yes     def.yes      def.yes          yes           not        def.yes       def.yes       def.yes
") %>% as_tibble() %>% mutate_all(~factor(., c("def.not", "not", "don't.know", "yes", "def.yes")))

output

# A tibble: 5 x 8
  Question1 Question2  Question3  Question4  Question5 Question6  Question7 Question8
  <fct>     <fct>      <fct>      <fct>      <fct>     <fct>      <fct>     <fct>    
1 def.not   don't.know don't.know don't.know def.not   don't.know yes       yes      
2 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes  
3 def.yes   def.yes    def.yes    def.yes    def.yes   def.yes    def.yes   def.yes  
4 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes  
5 def.yes   def.yes    def.yes    yes        not       def.yes    def.yes   def.yes  

Now your histogram will be much clearer, don't you think?

df %>% pivot_longer(everything()) %>% 
  ggplot(aes(value))+
  geom_histogram(stat="count")+
  facet_grid(rows=vars(name))

enter image description here

Upvotes: 2

Related Questions