How to efficiently edit massive amounts of data

Question

Here is an example dataset. Let me show everyone what I am doing to my data and then I will explain what I am struggling with. I apologize if the title isn't an accurate description. I tried my best but I am a bit new at this. Feel free to change it to something more suiting if needed

Location sample1 sample 2 sample 3
chr1:1234 0/1 1/1 0/0
chr2:5678 0/0 0/0 0/0
chr3:2345 1/1 1/1 1/1
chr4:6789 0/1 1/1 ./.

I use this to convert them to either a YES, NO, or MAYBE

replacement<-function(x){
  x=replace(x,which(x=='./.'),0.1) 
  x=replace(x,which(x=='0/0'),0)
  x=replace(x,which(x=='0/1'),1)
  x=replace(x,which(x=='1/1'),2)
}

test=apply(test.data.set,2,replacement)

test.data.2 <- as.data.frame(test)

replacement<-function(x){
  x=replace(x,which(x=='0.1'), "MAYBE") 
  x=replace(x,which(x=='0'), "NO")
  x=replace(x,which(x=='1'), "YES")
  x=replace(x,which(x=='2'), "YES")
}

test.data.3=apply(test.data.2,2,replacement)

test.data.4 <- as.data.frame(test.data.3)

Dataset after running

Location sample1 sample 2 sample 3
chr1:1234 YES YES NO
chr2:5678 NO NO NO
chr3:2345 YES YES YES
chr4:6789 YES YES MAYBE

So what I wrote above currently works for me. However, I have a new dataset that contains about 300 samples (columns) and about.. I'm not even sure, easily 500 million rows, so I need to alter over a billion "cells". I tried running this on a cluster with 256G with of memory and it just times out. I know what I wrote above is far from the "smoothest" way of altering my data. Does anyone have suggestions to streamline this process? I feel like dplyr has to have some kind of way to do this.

Any help would be amazing! Feel free to ask any questions if you need clarifications.

prosoitos · Accepted Answer

library(tidyverse)

Recreate your data:

df <- tibble(
  Location = letters[1:4],
  sample1 = c("0/1", "0/0", "1/1", "0/1"),
  sample2 = c("1/1", "0/0", "1/1", "1/1"),
  sample3 = c("0/0", "0/0", "1/1", "./.")
)

Code:

df %>% mutate_at(
  vars(- Location),
  funs(case_when(
    . == "1/1" | . == "0/1" ~ "YES",
    . == "0/0" ~ "NO",
    . == "./." ~ "MAYBE"
  ))
)

Result:

# A tibble: 4 x 4
  Location sample1 sample2 sample3
              
1 a        YES     YES     NO     
2 b        NO      NO      NO     
3 c        YES     YES     YES    
4 d        YES     YES     MAYBE

How to efficiently edit massive amounts of data

Answers (1)

Related Questions