Reputation: 937
I am relatively new to R. I have a dataframe df
that looks like this, where PMID is an ID:
PMID Variable Value
1 MH Humans
1 MH Male
1 MH Middle Aged
1 RN Aldosterone
1 RN Renin
2 MH Accidents, Traffic
2 MH Male
2 RN Antivenins
3 MH Humans
3 MH Crotulus
3 MH Young Adult
and so on. As you can see, some IDs have multiple MHs and/or RNs and some have none or one. I want to collapse all entries for each variable for each PMID. I also want to be able to separate each entry with a comma once collapsed, but first substitute the spaces present in the above dataframe into _
so that I can retain each value so that my final dataframe looks like this:
PMID MH RN
1 Humans, Male, Middle_Aged Aldosterone, Renin
2 Accidents,_Traffic, Male Antivenins
3 Humans, Crotulus, Young_Adult
I have over 5 million rows, so please help in making the code computationally efficient. Thanks for your help.
Upvotes: 0
Views: 221
Reputation: 42544
As the OP is asking for an efficient solution due to the size of the production data set of over 5 M rows, I suggest to use data.table
:
library(data.table) # CRAN version 1.10.4 used
setDT(df)[, Value := stringr::str_replace_all(Value, " ", "_")][]
dcast(df, PMID ~ Variable, toString, value.var = "Value")
PMID MH RN 1: 1 Humans, Male, Middle_Aged Aldosterone, Renin 2: 2 Accidents,_Traffic, Male Antivenins 3: 3 Humans, Crotulus, Young_Adult
df <- readr::read_table(
"PMID Variable Value
1 MH Humans
1 MH Male
1 MH Middle Aged
1 RN Aldosterone
1 RN Renin
2 MH Accidents, Traffic
2 MH Male
2 RN Antivenins
3 MH Humans
3 MH Crotulus
3 MH Young Adult"
)
Upvotes: 0
Reputation: 323226
This is my solution :
library(reshape2)
df=ddply(df,.(PMID,Variable), summarise,Pri = paste(Value,collapse=","))
acast(df, PMID ~ Variable)
MH RN
1 "Humans,Male,MiddleAged" "Aldosterone,Renin"
2 "Accidentstraffic,Male" "Antivenins"
3 "Humans,Crotulus,YoungAdult" NA
Upvotes: 0
Reputation: 27388
Here's one solution, using dplyr
and tidyr
:
library(dplyr)
library(tidyr)
d <- read.table(
text='PMID;Variable;Value
1;MH;Humans
1;MH;Male
1;MH;Middle Aged
1;RN;Aldosterone
1;RN;Renin
2;MH;Accidents, Traffic
2;MH;Male
2;RN;Antivenins
3;MH;Humans
3;MH;Crotulus
3;MH;Young Adult',
header=TRUE, sep=';', stringsAsFactors=FALSE)
d %>%
group_by(PMID, Variable) %>%
summarise(Value=paste(gsub(' ', '_', Value), collapse=', ')) %>%
spread(Variable, Value)
## Source: local data frame [3 x 3]
## Groups: PMID [3]
##
## # A tibble: 3 x 3
## PMID MH RN
## * <int> <chr> <chr>
## 1 1 Humans, Male, Middle_Aged Aldosterone, Renin
## 2 2 Accidents,_Traffic, Male Antivenins
## 3 3 Humans, Crotulus, Young_Adult <NA>
Upvotes: 3