Bruno Avila
Bruno Avila

Reputation: 296

Transform duplicate rows to columns

I'm working on a database with hundreds of variables, however, as its origin is JSON, it's making me very difficult to organize it. For example, instead of the file bringing the information in the columns, it creates new lines. See the example.

df1 <- data_frame(ID = c(111,111,111,111,111,111,222,222,333),
                  NAME = c('JOHN','JOHN','MARY','MARY','JAMES','JAMES','WILL','WILL','MARK'),
                  ADRESS = c('NY','NY','NY','NY','ROMA','ROMA','LONDON','TOKYO',''),
                  COLOR = c('GREEN','GREEN','RED','RED','YELLOW','YELLOW','BLUE','BLUE','ORANGE'),
                  CAR = c('','','BMW','BMW','TRUCK','TRUCK','FORD','FORD','FERRARI'),
                  COUNTRY = c('USA','USA','USA','USA','USA','USA','USA','USA','USA'))

I would like to organize the file in a way that it is grouped by ID, as in the example below:

df2 <- data_frame(ID = c(111,222,333),
                  NAME1 = c('JOHN','WILL','MARK'),
                  NAME2 = c('MARY','',''),
                  NAME3 = c('JAMES','',''),
                  ADRESS1 = c('NY','LONDON',''),
                  ADRESS2 = c('NY','TOKYO',''),
                  ADRESS3 = c('ROMA','',''),
                  COLOR1 = c('GREEN','BLUE','ORANGE'),
                  COLOR2 = c('RED','',''),
                  COLOR3 = c('YELLOW','',''),
                  CAR1 = c('','FORD','FERRARI'),
                  CAR2 = c('BMW','',''),
                  CAR3 = c('TRUCK','',''),
                  COUNTRY = c('USA','USA','USA'))

However, note that the COUNTRY variable does not need to have numerous columns (COUNTRY1, COUNTRY2, COUNTRY3) as the results are repeated. In my original file, I will find numerous situations like this. How would I arrange the data evenly in df2?

Upvotes: 1

Views: 329

Answers (2)

akrun
akrun

Reputation: 887048

An option is also with pivot_wider

library(dplyr)
library(tidyr)
library(data.table)
distinct(df1) %>% 
  mutate(rn = rowid(ID)) %>%
  pivot_wider(names_from = rn, values_from = NAME:CAR, 
    names_sep = "", values_fill = "") %>%
  select(-COUNTRY, COUNTRY)

-output

# A tibble: 3 × 14
     ID NAME1 NAME2  NAME3   ADRESS1  ADRESS2 ADRESS3 COLOR1 COLOR2 COLOR3   CAR1      CAR2   CAR3    COUNTRY
  <dbl> <chr> <chr>  <chr>   <chr>    <chr>   <chr>   <chr>  <chr>  <chr>    <chr>     <chr>  <chr>   <chr>  
1   111 JOHN  "MARY" "JAMES" "NY"     "NY"    "ROMA"  GREEN  "RED"  "YELLOW" ""        "BMW"  "TRUCK" USA    
2   222 WILL  "WILL" ""      "LONDON" "TOKYO" ""      BLUE   "BLUE" ""       "FORD"    "FORD" ""      USA    
3   333 MARK  ""     ""      ""       ""      ""      ORANGE ""     ""       "FERRARI" ""     ""      USA    

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

Perhaps we can try the following base R code using reshape

u <- reshape(
  transform(
    unique(df1),
    GRP = ave(seq_along(ID), ID, FUN = seq_along)
  ),
  direction = "wide",
  idvar = "ID",
  timevar = "GRP"
)

u[order(match(gsub("\\.\\d+", "", names(u)), names(df1)))]

which gives

> u[order(match(gsub("\\.\\d+", "", names(u)), names(df1)))]
   ID NAME.1 NAME.2 NAME.3 ADRESS.1 ADRESS.2 ADRESS.3 COLOR.1 COLOR.2 COLOR.3
1 111   JOHN   MARY  JAMES       NY       NY     ROMA   GREEN     RED  YELLOW
7 222   WILL   WILL   <NA>   LONDON    TOKYO     <NA>    BLUE    BLUE    <NA>
9 333   MARK   <NA>   <NA>              <NA>     <NA>  ORANGE    <NA>    <NA>
    CAR.1 CAR.2 CAR.3 COUNTRY.1 COUNTRY.2 COUNTRY.3
1           BMW TRUCK       USA       USA       USA
7    FORD  FORD  <NA>       USA       USA      <NA>
9 FERRARI  <NA>  <NA>       USA      <NA>      <NA>

Upvotes: 1

Related Questions