sweetmusicality
sweetmusicality

Reputation: 937

Reshaping from wide to long data while collapsing variable values for same IDs in R

I am relatively new to R. I have a dataframe df that looks like this, where PMID is an ID:

PMID          Variable         Value
1             MH               Humans
1             MH               Male
1             MH               Middle Aged
1             RN               Aldosterone
1             RN               Renin
2             MH               Accidents, Traffic
2             MH               Male
2             RN               Antivenins
3             MH               Humans
3             MH               Crotulus
3             MH               Young Adult

and so on. As you can see, some IDs have multiple MHs and/or RNs and some have none or one. I want to collapse all entries for each variable for each PMID. I also want to be able to separate each entry with a comma once collapsed, but first substitute the spaces present in the above dataframe into _ so that I can retain each value so that my final dataframe looks like this:

PMID         MH                                 RN
1            Humans, Male, Middle_Aged          Aldosterone, Renin
2            Accidents,_Traffic, Male           Antivenins
3            Humans, Crotulus, Young_Adult

I have over 5 million rows, so please help in making the code computationally efficient. Thanks for your help.

Upvotes: 0

Views: 221

Answers (3)

Uwe
Uwe

Reputation: 42544

As the OP is asking for an efficient solution due to the size of the production data set of over 5 M rows, I suggest to use data.table:

library(data.table)   # CRAN version 1.10.4 used
setDT(df)[, Value := stringr::str_replace_all(Value, " ", "_")][]
dcast(df, PMID ~ Variable, toString, value.var = "Value")
   PMID                            MH                 RN
1:    1     Humans, Male, Middle_Aged Aldosterone, Renin
2:    2      Accidents,_Traffic, Male         Antivenins
3:    3 Humans, Crotulus, Young_Adult

Data

df <- readr::read_table(
 "PMID          Variable         Value
  1             MH               Humans
  1             MH               Male
  1             MH               Middle Aged
  1             RN               Aldosterone
  1             RN               Renin
  2             MH               Accidents, Traffic
  2             MH               Male
  2             RN               Antivenins
  3             MH               Humans
  3             MH               Crotulus
  3             MH               Young Adult"
)

Upvotes: 0

BENY
BENY

Reputation: 323226

This is my solution :

library(reshape2)
df=ddply(df,.(PMID,Variable), summarise,Pri = paste(Value,collapse=","))
acast(df, PMID ~ Variable)

  MH                           RN                 
1 "Humans,Male,MiddleAged"     "Aldosterone,Renin"
2 "Accidentstraffic,Male"      "Antivenins"       
3 "Humans,Crotulus,YoungAdult" NA  

Upvotes: 0

jbaums
jbaums

Reputation: 27388

Here's one solution, using dplyr and tidyr:

library(dplyr)
library(tidyr)

d <- read.table(
text='PMID;Variable;Value
1;MH;Humans
1;MH;Male
1;MH;Middle Aged
1;RN;Aldosterone
1;RN;Renin
2;MH;Accidents, Traffic
2;MH;Male
2;RN;Antivenins
3;MH;Humans
3;MH;Crotulus
3;MH;Young Adult', 
header=TRUE, sep=';', stringsAsFactors=FALSE)

d %>% 
  group_by(PMID, Variable) %>% 
  summarise(Value=paste(gsub(' ', '_', Value), collapse=', ')) %>% 
  spread(Variable, Value)


## Source: local data frame [3 x 3]
## Groups: PMID [3]
## 
## # A tibble: 3 x 3
##    PMID                            MH                  RN
## * <int>                         <chr>               <chr>
## 1     1     Humans, Male, Middle_Aged  Aldosterone, Renin
## 2     2      Accidents,_Traffic, Male          Antivenins
## 3     3 Humans, Crotulus, Young_Adult                <NA>

Upvotes: 3

Related Questions