Managing duplicated rows in large data frames

Question

I would like to tag samples (sample_id collumn) with more than one State in the same no collumn with E string.

My df dataframe input:

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    0n
chr1-15984544-15996851-0n  NE001788    1n
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

My expected output:

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    E
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

The sample NE001788 was tagged with E because it have two different states (State) in a same no string. I had used the below code to small dataframes:

df <- read.table(text= 'no  sample_id  State 
                 chr1-15984544-15996851-0n  NE001788    0n
                 chr1-15984544-15996851-0n  NE001788    1n
                 chr1-15984544-15996851-0n  NE001836    0n
                 chr1-15984544-15996851-0n  NE002026    0n
                 chr1-15984544-15996851-0n  NE001413    0n
                 chr1-15984544-15996851-0n  NE001438    0n',header=TRUE) 

library(plyr)
output <- unique(ddply(df,.(no,sample_id),mutate,State=if(length(unique(State))>1) {"E"} else State))

It works fine. However, I have now a large data frame (more than 700k rows). In this large dataframe I get a memory error: cannot allocate vector of size 75kb.

I am here to ask alternatives to reach the same result, without memory breakout.

Thank you very much.

David Arenburg · Accepted Answer

Try data.table. I didn't benchmark this code, but it should be certainly better than plyr

library(data.table)
df <- setDT(df)[, lapply(.SD, function(x) ifelse(.N > 1, "E", as.character(x))), by = c("no", "sample_id"), .SDcols = "State"]

##                           no sample_id State
## 1: chr1-15984544-15996851-0n  NE001788     E
## 2: chr1-15984544-15996851-0n  NE001836    0n
## 3: chr1-15984544-15996851-0n  NE002026    0n
## 4: chr1-15984544-15996851-0n  NE001413    0n
## 5: chr1-15984544-15996851-0n  NE001438    0n

Better option will be to first make State a character (if it's not already) in order to avoid doing as.character in each group, and then do the subsetting. Something like

setDT(df)[, State := as.character(State)]
df <- df[, lapply(.SD, function(x) ifelse(.N > 1, "E", x)), by = c("no", "sample_id"), .SDcols = "State"]

Managing duplicated rows in large data frames

Answers (2)

Related Questions