user3091668
user3091668

Reputation: 2310

Managing duplicated rows in large data frames

I would like to tag samples (sample_id collumn) with more than one State in the same no collumn with E string.

My df dataframe input:

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    0n
chr1-15984544-15996851-0n  NE001788    1n
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

My expected output:

          no               sample_id  State 
chr1-15984544-15996851-0n  NE001788    E
chr1-15984544-15996851-0n  NE001836    0n
chr1-15984544-15996851-0n  NE002026    0n
chr1-15984544-15996851-0n  NE001413    0n
chr1-15984544-15996851-0n  NE001438    0n

The sample NE001788 was tagged with E because it have two different states (State) in a same no string. I had used the below code to small dataframes:

df <- read.table(text= 'no  sample_id  State 
                 chr1-15984544-15996851-0n  NE001788    0n
                 chr1-15984544-15996851-0n  NE001788    1n
                 chr1-15984544-15996851-0n  NE001836    0n
                 chr1-15984544-15996851-0n  NE002026    0n
                 chr1-15984544-15996851-0n  NE001413    0n
                 chr1-15984544-15996851-0n  NE001438    0n',header=TRUE) 

library(plyr)
output <- unique(ddply(df,.(no,sample_id),mutate,State=if(length(unique(State))>1) {"E"} else State))

It works fine. However, I have now a large data frame (more than 700k rows). In this large dataframe I get a memory error: cannot allocate vector of size 75kb.

I am here to ask alternatives to reach the same result, without memory breakout.

Thank you very much.

Upvotes: 0

Views: 67

Answers (2)

talat
talat

Reputation: 70256

And here's the dyplr code to do it:

dd %>%
  mutate(State = as.character(State)) %>%
  group_by(no, sample_id) %>%
  summarize(State = ifelse(length(unique(State)) > 1, "E", State))

Most likely, using dplyr will be faster than plyr, but I don't know how it compares in terms of memory usage, since that seems to be the bottleneck in your case.

Note that I convert State to character before the operation because if you read in the data from the question, it will be factors. If in reality they are characters, you can skip that of course.

Note: I use length(unique(State)) > 1 to cover the (hypothetical) case where entries in no, sample_id and State are all the same in multiple rows. Based on your description you wouldn't want to assign E to State in that case, but it's not clear if such a case is possible at all in your data. If not, you could replace length(unique(State)) > 1 with n() > 1.

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92282

Try data.table. I didn't benchmark this code, but it should be certainly better than plyr

library(data.table)
df <- setDT(df)[, lapply(.SD, function(x) ifelse(.N > 1, "E", as.character(x))), by = c("no", "sample_id"), .SDcols = "State"]

##                           no sample_id State
## 1: chr1-15984544-15996851-0n  NE001788     E
## 2: chr1-15984544-15996851-0n  NE001836    0n
## 3: chr1-15984544-15996851-0n  NE002026    0n
## 4: chr1-15984544-15996851-0n  NE001413    0n
## 5: chr1-15984544-15996851-0n  NE001438    0n

Better option will be to first make State a character (if it's not already) in order to avoid doing as.character in each group, and then do the subsetting. Something like

setDT(df)[, State := as.character(State)]
df <- df[, lapply(.SD, function(x) ifelse(.N > 1, "E", x)), by = c("no", "sample_id"), .SDcols = "State"]

Upvotes: 3

Related Questions