Reputation: 5138

Concatenate pairs of columns for entire dataframe

I am working with genetic data and I need to concatenate pairs of columns. The data I have has the major and minor alleles in separate columns (e.g., allele1a, allele1b, allele2a, allele2b, etc. etc.). I need a way to pairs of columns for the entire data frame. I included a sample below, but my data has 1.7 million pairs (so I have 3.4 million columns right now), so it will not work if I need to name each column. I will change the column names later. Any guidance is appreciated if there is a way to do this in R. I have tried to create a sequence and paste them, something like:

df <- data.frame(id = seq(1,20),
                 var1 = rep("A", 20),
                 var2 = c(rep("T", 10), rep("A", 10)),
                 var3 = rep("C", 20),
                 var4 = c(rep("C", 10), rep("G", 10)),
                 var5 = rep("A", 20),
                 var6 = c(rep("A", 10), rep("G", 10)),
                 stringsAsFactors = FALSE)

i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])

but that did not work. I want it to go from:

    id var1 var2 var3 var4 var5 var6
1   1    A    T    C    C    A    A
2   2    A    T    C    C    A    A
3   3    A    T    C    C    A    A
4   4    A    T    C    C    A    A
5   5    A    T    C    C    A    A
6   6    A    T    C    C    A    A
7   7    A    T    C    C    A    A
8   8    A    T    C    C    A    A
9   9    A    T    C    C    A    A
10 10    A    T    C    C    A    A
11 11    A    A    C    G    A    G
12 12    A    A    C    G    A    G
13 13    A    A    C    G    A    G
14 14    A    A    C    G    A    G
15 15    A    A    C    G    A    G
16 16    A    A    C    G    A    G
17 17    A    A    C    G    A    G
18 18    A    A    C    G    A    G
19 19    A    A    C    G    A    G
20 20    A    A    C    G    A    G

to:

   id var1 var2 var3
1   1   AT   CC   AA
2   2   AT   CC   AA
3   3   AT   CC   AA
4   4   AT   CC   AA
5   5   AT   CC   AA
6   6   AT   CC   AA
7   7   AT   CC   AA
8   8   AT   CC   AA
9   9   AT   CC   AA
10 10   AT   CC   AA
11 11   AA   CG   AG
12 12   AA   CG   AG
13 13   AA   CG   AG
14 14   AA   CG   AG
15 15   AA   CG   AG
16 16   AA   CG   AG
17 17   AA   CG   AG
18 18   AA   CG   AG
19 19   AA   CG   AG
20 20   AA   CG   AG

edit: Thank you!!! I was able to adapt two of the answers for my data and @akrun's ran a little faster. I created a subset of my data with 100 rows and 100,000 columns and the results are below:

microbenchmark(
+   {
+   new <- ped %>%
+   gather(key = V, value = value, -id) %>%
+   mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+   group_by(id) %>%
+   mutate(pair = ceiling(V / 2)) %>% 
+   group_by(id, pair) %>%
+   summarise(combined = paste(value, collapse = "")) %>%
+   mutate(V_combo = paste0("V", pair)) %>%
+   select(-pair) %>%
+   spread(key = V_combo, value = combined) %>%
+   select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+   },
+   {
+   out <- ped[1]
+   new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+   
+   out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2), 
+                           function(i) do.call(paste0, ped[i:(i+1)]))
+   },
+   times = 1
+   )

Unit: seconds                                                                                                                                                                                                                                                                                                                                                                                                                                               

   expr           min        lq      mean    median        uq       max neval
camille     250.30901 250.30901 250.30901 250.30901 250.30901 250.30901     1
akrun       23.52434  23.52434  23.52434  23.52434  23.52434  23.52434     1
    > 
    > new <- data.frame(new, stringsAsFactors = FALSE)
    > identical(new, out)
    [1] TRUE

Upvotes: 3

Answers (5)

Onyambu

Reputation: 79358

using base r you could do:

 a <- seq(2,ncol(df),2)
 b <- paste0(unlist(df[a]),unlist(df[a+1]))
 d <- data.frame(matrix(b,nrow(df)))
 result <- cbind(df[1],d)

This can also be written in a one line:

(dat =  data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
   X1 X2 X3
1  AT CC AA
2  AT CC AA
3  AT CC AA
4  AT CC AA
5  AT CC AA
6  AT CC AA
7  AT CC AA
8  AT CC AA
9  AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG

Then cbind it with the id column:

cbind(df[1],dat)

Upvotes: 2

Artem Sokolov

Reputation: 13731

Using tidyverse, you can compose the modifying expressions ahead of time, then pass them all to transmute in bulk. This solution uses column names and is therefore robust to the column ordering: if you shuffle your allele columns, this should still give you the same answer.

library( tidyverse )

# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
    map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )

df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
#       id allele1 allele2 allele3
#    <int> <chr>   <chr>   <chr>  
#  1     1 AT      CC      AA     
#  2     2 AT      CC      AA     
#  3     3 AT      CC      AA     
#  4     4 AT      CC      AA     
# ...

I modified your data to closer match your description:

df <- data_frame(id = seq(1,20),
             allele1a = rep("A", 20),
             allele1b = c(rep("T", 10), rep("A", 10)),
             allele2a = rep("C", 20),
             allele2b = c(rep("C", 10), rep("G", 10)),
             allele3a = rep("A", 20),
             allele3b = c(rep("A", 10), rep("G", 10)))

Upvotes: 2

camille

Reputation: 16881

Here's a tidyverse way designed to scale fairly well. Instead of hard-coding that you want to pair columns 1 & 2, 3 & 4, and 5 & 6, I'm reshaping to long data to get a variable number, grouping those into pairs by dividing the variable number by 2, collapsing the letters in each pair, and reshaping back to wide. This way, you can do the same procedure on any even number of columns.

library(tidyverse)
...

Filtering for ID 1 to show a glimpse of this:

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>%
  filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups:   id [1]
#>      id   var value  pair
#>   <int> <dbl> <chr> <dbl>
#> 1     1     1 A         1
#> 2     1     2 T         1
#> 3     1     3 C         2
#> 4     1     4 C         2
#> 5     1     5 A         3
#> 6     1     6 A         3

Then collapsing strings as a summarizing value for each combination of ID and pair:

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>% 
  group_by(id, pair) %>%
  summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups:   id [?]
#>       id  pair combined
#>    <int> <dbl> <chr>   
#>  1     1     1 AT      
#>  2     1     2 CC      
#>  3     1     3 AA      
#>  4     2     1 AT      
#>  5     2     2 CC      
#>  6     2     3 AA      
#>  7     3     1 AT      
#>  8     3     2 CC      
#>  9     3     3 AA      
#> 10     4     1 AT      
#> # ... with 50 more rows

And using spread to get back into a wide format.

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>% 
  group_by(id, pair) %>%
  summarise(combined = paste(value, collapse = "")) %>%
  mutate(var_combo = paste0("var", pair)) %>%
  select(-pair) %>%
  spread(key = var_combo, value = combined) %>%
  head()
#> # A tibble: 6 x 4
#> # Groups:   id [6]
#>      id var1  var2  var3 
#>   <int> <chr> <chr> <chr>
#> 1     1 AT    CC    AA   
#> 2     2 AT    CC    AA   
#> 3     3 AT    CC    AA   
#> 4     4 AT    CC    AA   
#> 5     5 AT    CC    AA   
#> 6     6 AT    CC    AA

^{Created on 2018-11-07 by the reprex package (v0.2.1)}

Upvotes: 2

akrun

Reputation: 887971

We can create a loop to subset the columns along with the adjacent column, paste it together withdo.call` and assign it as new columns to the new dataset

out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2), 
               function(i) do.call(paste0, df[i:(i+1)]))

Upvotes: 3

Alexandra Thayer

Reputation: 43

df <- data.frame(id = seq(1,20),
                 var1 = rep("A", 20),
                 var2 = c(rep("T", 10), rep("A", 10)),
                 var3 = rep("C", 20),
                 var4 = c(rep("C", 10), rep("G", 10)),
                 var5 = rep("A", 20),
                 var6 = c(rep("A", 10), rep("G", 10)),
                 stringsAsFactors = FALSE)

df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""), 
                  var2 = paste(df[,4], df[,5], sep = ""), 
                  var3 = paste(df[,6], df[,7], sep = ""))

Upvotes: 0

Concatenate pairs of columns for entire dataframe

Answers (5)

Related Questions