Reputation: 2310
I would like to tag samples (sample_id
collumn) with more than one State
in the same no
collumn with E
string.
My df
dataframe input:
no sample_id State
chr1-15984544-15996851-0n NE001788 0n
chr1-15984544-15996851-0n NE001788 1n
chr1-15984544-15996851-0n NE001836 0n
chr1-15984544-15996851-0n NE002026 0n
chr1-15984544-15996851-0n NE001413 0n
chr1-15984544-15996851-0n NE001438 0n
My expected output
:
no sample_id State
chr1-15984544-15996851-0n NE001788 E
chr1-15984544-15996851-0n NE001836 0n
chr1-15984544-15996851-0n NE002026 0n
chr1-15984544-15996851-0n NE001413 0n
chr1-15984544-15996851-0n NE001438 0n
The sample NE001788
was tagged with E because it have two different states (State
) in a same no
string.
I had used the below code to small dataframes:
df <- read.table(text= 'no sample_id State
chr1-15984544-15996851-0n NE001788 0n
chr1-15984544-15996851-0n NE001788 1n
chr1-15984544-15996851-0n NE001836 0n
chr1-15984544-15996851-0n NE002026 0n
chr1-15984544-15996851-0n NE001413 0n
chr1-15984544-15996851-0n NE001438 0n',header=TRUE)
library(plyr)
output <- unique(ddply(df,.(no,sample_id),mutate,State=if(length(unique(State))>1) {"E"} else State))
It works fine. However, I have now a large data frame (more than 700k rows). In this large dataframe I get a memory error: cannot allocate vector of size 75kb
.
I am here to ask alternatives to reach the same result, without memory breakout.
Thank you very much.
Upvotes: 0
Views: 67
Reputation: 70256
And here's the dyplr
code to do it:
dd %>%
mutate(State = as.character(State)) %>%
group_by(no, sample_id) %>%
summarize(State = ifelse(length(unique(State)) > 1, "E", State))
Most likely, using dplyr
will be faster than plyr
, but I don't know how it compares in terms of memory usage, since that seems to be the bottleneck in your case.
Note that I convert State
to character before the operation because if you read in the data from the question, it will be factor
s. If in reality they are characters, you can skip that of course.
Note: I use length(unique(State)) > 1
to cover the (hypothetical) case where entries in no
, sample_id
and State
are all the same in multiple rows. Based on your description you wouldn't want to assign E
to State
in that case, but it's not clear if such a case is possible at all in your data. If not, you could replace length(unique(State)) > 1
with n() > 1
.
Upvotes: 1
Reputation: 92282
Try data.table
. I didn't benchmark this code, but it should be certainly better than plyr
library(data.table)
df <- setDT(df)[, lapply(.SD, function(x) ifelse(.N > 1, "E", as.character(x))), by = c("no", "sample_id"), .SDcols = "State"]
## no sample_id State
## 1: chr1-15984544-15996851-0n NE001788 E
## 2: chr1-15984544-15996851-0n NE001836 0n
## 3: chr1-15984544-15996851-0n NE002026 0n
## 4: chr1-15984544-15996851-0n NE001413 0n
## 5: chr1-15984544-15996851-0n NE001438 0n
Better option will be to first make State
a character (if it's not already) in order to avoid doing as.character
in each group, and then do the subsetting. Something like
setDT(df)[, State := as.character(State)]
df <- df[, lapply(.SD, function(x) ifelse(.N > 1, "E", x)), by = c("no", "sample_id"), .SDcols = "State"]
Upvotes: 3