Reputation: 5138
I am working with genetic data and I need to concatenate pairs of columns. The data I have has the major and minor alleles in separate columns (e.g., allele1a, allele1b, allele2a, allele2b, etc. etc.). I need a way to pairs of columns for the entire data frame. I included a sample below, but my data has 1.7 million pairs (so I have 3.4 million columns right now), so it will not work if I need to name each column. I will change the column names later. Any guidance is appreciated if there is a way to do this in R. I have tried to create a sequence and paste them, something like:
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])
but that did not work. I want it to go from:
id var1 var2 var3 var4 var5 var6
1 1 A T C C A A
2 2 A T C C A A
3 3 A T C C A A
4 4 A T C C A A
5 5 A T C C A A
6 6 A T C C A A
7 7 A T C C A A
8 8 A T C C A A
9 9 A T C C A A
10 10 A T C C A A
11 11 A A C G A G
12 12 A A C G A G
13 13 A A C G A G
14 14 A A C G A G
15 15 A A C G A G
16 16 A A C G A G
17 17 A A C G A G
18 18 A A C G A G
19 19 A A C G A G
20 20 A A C G A G
to:
id var1 var2 var3
1 1 AT CC AA
2 2 AT CC AA
3 3 AT CC AA
4 4 AT CC AA
5 5 AT CC AA
6 6 AT CC AA
7 7 AT CC AA
8 8 AT CC AA
9 9 AT CC AA
10 10 AT CC AA
11 11 AA CG AG
12 12 AA CG AG
13 13 AA CG AG
14 14 AA CG AG
15 15 AA CG AG
16 16 AA CG AG
17 17 AA CG AG
18 18 AA CG AG
19 19 AA CG AG
20 20 AA CG AG
edit: Thank you!!! I was able to adapt two of the answers for my data and @akrun's ran a little faster. I created a subset of my data with 100 rows and 100,000 columns and the results are below:
microbenchmark(
+ {
+ new <- ped %>%
+ gather(key = V, value = value, -id) %>%
+ mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+ group_by(id) %>%
+ mutate(pair = ceiling(V / 2)) %>%
+ group_by(id, pair) %>%
+ summarise(combined = paste(value, collapse = "")) %>%
+ mutate(V_combo = paste0("V", pair)) %>%
+ select(-pair) %>%
+ spread(key = V_combo, value = combined) %>%
+ select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+ },
+ {
+ out <- ped[1]
+ new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+
+ out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2),
+ function(i) do.call(paste0, ped[i:(i+1)]))
+ },
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
camille 250.30901 250.30901 250.30901 250.30901 250.30901 250.30901 1
akrun 23.52434 23.52434 23.52434 23.52434 23.52434 23.52434 1
>
> new <- data.frame(new, stringsAsFactors = FALSE)
> identical(new, out)
[1] TRUE
Upvotes: 3
Views: 1280
Reputation: 79358
using base r you could do:
a <- seq(2,ncol(df),2)
b <- paste0(unlist(df[a]),unlist(df[a+1]))
d <- data.frame(matrix(b,nrow(df)))
result <- cbind(df[1],d)
This can also be written in a one line:
(dat = data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
X1 X2 X3
1 AT CC AA
2 AT CC AA
3 AT CC AA
4 AT CC AA
5 AT CC AA
6 AT CC AA
7 AT CC AA
8 AT CC AA
9 AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG
Then cbind it with the id column:
cbind(df[1],dat)
Upvotes: 2
Reputation: 13731
Using tidyverse
, you can compose the modifying expressions ahead of time, then pass them all to transmute
in bulk. This solution uses column names and is therefore robust to the column ordering: if you shuffle your allele
columns, this should still give you the same answer.
library( tidyverse )
# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )
df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
# id allele1 allele2 allele3
# <int> <chr> <chr> <chr>
# 1 1 AT CC AA
# 2 2 AT CC AA
# 3 3 AT CC AA
# 4 4 AT CC AA
# ...
I modified your data to closer match your description:
df <- data_frame(id = seq(1,20),
allele1a = rep("A", 20),
allele1b = c(rep("T", 10), rep("A", 10)),
allele2a = rep("C", 20),
allele2b = c(rep("C", 10), rep("G", 10)),
allele3a = rep("A", 20),
allele3b = c(rep("A", 10), rep("G", 10)))
Upvotes: 2
Reputation: 16881
Here's a tidyverse
way designed to scale fairly well. Instead of hard-coding that you want to pair columns 1 & 2, 3 & 4, and 5 & 6, I'm reshaping to long data to get a variable number, grouping those into pairs by dividing the variable number by 2, collapsing the letters in each pair, and reshaping back to wide. This way, you can do the same procedure on any even number of columns.
library(tidyverse)
...
Filtering for ID 1 to show a glimpse of this:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups: id [1]
#> id var value pair
#> <int> <dbl> <chr> <dbl>
#> 1 1 1 A 1
#> 2 1 2 T 1
#> 3 1 3 C 2
#> 4 1 4 C 2
#> 5 1 5 A 3
#> 6 1 6 A 3
Then collapsing strings as a summarizing value for each combination of ID and pair:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups: id [?]
#> id pair combined
#> <int> <dbl> <chr>
#> 1 1 1 AT
#> 2 1 2 CC
#> 3 1 3 AA
#> 4 2 1 AT
#> 5 2 2 CC
#> 6 2 3 AA
#> 7 3 1 AT
#> 8 3 2 CC
#> 9 3 3 AA
#> 10 4 1 AT
#> # ... with 50 more rows
And using spread
to get back into a wide format.
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = "")) %>%
mutate(var_combo = paste0("var", pair)) %>%
select(-pair) %>%
spread(key = var_combo, value = combined) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: id [6]
#> id var1 var2 var3
#> <int> <chr> <chr> <chr>
#> 1 1 AT CC AA
#> 2 2 AT CC AA
#> 3 3 AT CC AA
#> 4 4 AT CC AA
#> 5 5 AT CC AA
#> 6 6 AT CC AA
Created on 2018-11-07 by the reprex package (v0.2.1)
Upvotes: 2
Reputation: 887971
We can create a loop to subset the columns along with the adjacent column, paste
it together with
do.call` and assign it as new columns to the new dataset
out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2),
function(i) do.call(paste0, df[i:(i+1)]))
Upvotes: 3
Reputation: 43
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""),
var2 = paste(df[,4], df[,5], sep = ""),
var3 = paste(df[,6], df[,7], sep = ""))
Upvotes: 0