student999
student999

Reputation: 3

How to clean up data in R?

Currently, my data is in the following format:

Var   year   Co-1      Co-2     Co-3  ...
A     2018     a         j        .  
A     2017     b         k        .
A     2016     c         l        .
B     2018     d         m        .
B     2017     e         n        .
B     2016     f         o        .
C     2018     g         p        .
C     2017     h         q        .
C     2016     i         r        .
.       .      .         .        .
.       .      .         .
.       .      .         .

I want to transform it to the following format:

Company   year    A       B       C
Co-1      2018    a       d       g
Co-1      2017    b       e       h
Co-1      2016    c       f       i
Co-2      2018    j       m       p 
Co-2      2017    k       n       q
Co-2      2016    l       o       r
Co-3      2018    .       .
Co-3      2017    .       .
Co-3      2016    .       .
.
.
.

Essentially the changes are:

  1. Inserting the company name multiple times in the first column, one for each year (2018,17,16)
  2. Making the variable in each column header be A, B, and C, rather than having multiple AAA,BBB,CCCs in the first column

By doing this, I want to be able to regress year vs. each of A, B, and C separately, while keeping the Company distinction for each data point, so I can group the data points by company in the finished graph.

Thank you so much!

Upvotes: 0

Views: 110

Answers (2)

jay.sf
jay.sf

Reputation: 73692

I think it would be much easier to simply use factors here. Length of labels should be the number of unique companies, i.e. "Var" values

df$Var <- factor(df$Var, labels=c("Pepsi", "Coke", "Sprite"))
names(df) <- c("company", "year", LETTERS[seq(names(df)[-(1:2)])])

Or in a single step:

df <- setNames(transform(df, Var=factor(Var, labels=c("Pepsi", "Coke", "Sprite"))),
               c("company", "year", LETTERS[seq(names(df)[-(1:2)])]))
df
#   company year A B C
# 1   Pepsi 2018 a j .
# 2   Pepsi 2017 b k .
# 3   Pepsi 2016 c l .
# 4    Coke 2018 d m .
# 5    Coke 2017 e n .
# 6    Coke 2016 f o .
# 7  Sprite 2018 g p .
# 8  Sprite 2017 h q .
# 9  Sprite 2016 i r .

Also produces cleaner classes:

sapply(df, class)
# company        year           A           B           C 
# "factor"   "integer" "character" "character" "character" 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389275

Get the data in long format and then in wide but with different columns.

library(tidyr)

df %>%
  pivot_longer(cols = starts_with("Co")) %>%
  pivot_wider(names_from = Var, values_from = value)

# A tibble: 6 x 5
#   year name  A     B     C    
#  <int> <chr> <fct> <fct> <fct>
#1  2018 Co-1  a     d     g    
#2  2018 Co-2  j     m     p    
#3  2017 Co-1  b     e     h    
#4  2017 Co-2  k     n     q    
#5  2016 Co-1  c     f     i    
#6  2016 Co-2  l     o     r    

data

df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), year = c(2018L, 
2017L, 2016L, 2018L, 2017L, 2016L, 2018L, 2017L, 2016L), `Co-1` = 
structure(1:9, .Label = c("a", "b", "c", "d", "e", "f", "g", "h", "i"), 
class = "factor"), `Co-2` = structure(1:9, .Label = c("j", 
"k", "l", "m", "n", "o", "p", "q", "r"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))

Upvotes: 2

Related Questions