Huy Nguyen
Huy Nguyen

Reputation: 61

R - Speeding up combination between for loop and paste/paste0

I am handling a data frame 'df' that have millions of rows and four columns (i.e., Chromosome, Position, Allele1, Allele2). Now I am wanting to concatenate characters in these columns into one separate vector 'cc'. This is my first try:

myfunc = function(CHR) {
    chr = subset(df, df$Chromosome == CHR)
    cc = data.frame(No=seq.int(nrow(chr)), pos_al1_al2=NA)
    for (i in 1: nrow(chr)) {
        cc$pos_al1_al2[i] = paste(CHR, chr$Position[i], ".", chr$Allele1[i], chr$Allele2[i])
        cc = cc[, -1] # remove the column 'No'
    }
} 

# Run my code 
myfunc(7)

where CHR is the number of chromosome of my interest I will input to the function (e.g., 1,2,3,..., or 22). Of course, CHR must be in a range of from 1 to 22 as in the column Chromosome of the 'df'.

My idea is that: I first created an empty vector called cc whose the number of rows are the same as the data.frame 'df'.

Now I created a new column in the cc called pos_al1_al2 whose each row includes characters as you can see in the function.

The computation time is very slow. I guess It comes from the for loop but I do have no idea to optimize my function.

Any help is appreciated! Thanks in advance.

Upvotes: 0

Views: 101

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520968

Is there any reason why you can't use paste() in vectorized mode:

myfunc <- function(CHR) {
    chr <- subset(df, df$Chromosome == CHR)
    cc <- data.frame(No = seq.int(nrow(chr)), pos_al1_al2=NA)
    cc$pos_al1_al2 <- paste(CHR, chr$Position, ".", chr$Allele1, chr$Allele2)
    cc = cc[, -1] # remove the column 'No'
}

Upvotes: 2

Related Questions