Combine column values with partially matching names into one semicolon-separated field in R

Question

I've looked through SO and have not found any advice that accurately explains what I am looking for.

I have a giant table. The first few columns have information about different expressed transcripts and the SNP which influences it. The remainder of the columns (of which there are around a thousand) are either information about an individual's tissue sample (with a column header such as GTEX.11DXX.1426.SM.5GIDU) or the individual's ID (GTEX.11DXX). The information under these columns contain either the number of transcripts expressed (e.g. 92) at that particular sequence and a binary value representing whether the allele that influences the expression of that transcript is Neandertal inherited or not (1 or 0), respectively.

What I want to do is consolidate the data underneath the binary columns with the data underneath the transcript number columns like so:

GTEX.11DXX.1426.SM.5GIDU
0;25
1;74
1;104
1;92
0;12
...
etc.

I want to accomplish this by partially matching the column name GTEX.11DXX with GTEX.11DXX.1426.SM.5GIDU, and then getting rid of binary columns so it's just the long column names.

I've tried using tidyverse's map(v, ~select_(ovary, ~matches(.))), and it kind of works, but that matches even if a one character is off, like so:

[[49]]
       GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H GTEX.13X6I GTEX.13X6J GTEX.13X6K
    1:                       49          0          0          0          1
    2:                       44          0          0          0          1
    3:                        3          0          0          0          1
    4:                       23          0          0          0          1
    5:                       78          0          0          0          1
   ---                                                                     
80285:                       84          1          0          0          0
80286:                        1          1          0          0          0
80287:                        0          1          0          0          0
80288:                      152          1          0          0          0
80289:                      120          1          0          0          0

Again, I want to to work like this:

       GTEX.13X6H.1026.SM.5SIBE
    1:                     0;49
    2:                     0;44
    3:                      0;3
    4:                     0;23

Thank you

akraf · Accepted Answer

Your problems come because your data set is

...in wide format, meaning that data (individual names) is found in column headers. That makes it difficult to look up matching SNP <--> individual columns. See here for information about what is wide vs. long data
... actually two data sets, merged together. One is about transcripts vs. number expressed (the long headers) and the other data set is transcripts vs. neandertal or not.

Therefore, put the tissue names and individual names into their own column instead of using them as column names! This is described in the following:

Create one data set (I call it dataset1) that has only information on allele counts. Do that by removing unnessecary columns. Only the following columns should remain:
- a column with a unique transcript name
- the columns of the pattern GTEX.11DXX.1426.SM.5GIDU
Create another data set (-> dataset2) similarly to above, but it includes only:
- the column with the unique transcript name
- the columns of the pattern GTEX.11DXX

Use tidyr::gather or the newer tidyr::pivot* (that I have no experience with yet) to get dataset1 look like this

 transcript    tissue                    count
 A             GTEX.13X6H.1026.SM.5SIBE  49
 B             GTEX.13X6H.1026.SM.5SIBE  44
 C             GTEX.13X6H.1026.SM.5SIBE   3
 ...
 A             GTEX.13X6I.1026.SM.5SIBE  10
 B             GTEX.13X6I.1026.SM.5SIBE  11
 ...

and dataset2 like this:

 transcript    individual   isNeandertal
 A             GTEX.13X6H   0
 B             GTEX.13X6H   0
 ...
 A             GTEX.13X6I   0           
 B             GTEX.13X6I   0
 ...

In data set 1, get the individual name from the tissue name, making the data set look like this:

transcript    tissue                    individual  count
A             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H  49
B             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H  44
C             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H   3
...
A             GTEX.13X6I.1026.SM.5SIBE  GTEX.13X6I  10
B             GTEX.13X6I.1026.SM.5SIBE  GTEX.13X6I  11
...

Now you can use dplyr::full_join(dataset1, dataset2, by = c("transcript", "individual")) to merge the allele counts and the neandertal status.

This works only if each combination of (transcript, individual) occurs only once per data set! If these columns have the same combination in multiple rows, full_join does not know which rows from dataset1 and dataset2 belong together and will output all combinations of them!

Check this by
stopifnot(!anyDuplicated(dataset1[, c("transcript", "individual")])) and
stopifnot(!anyDuplicated(dataset2[, c("transcript", "individual")]))(with your variable names, of course)

The result of this step should look like this:

    transcript    tissue                    individual  count  isNeandertal
    A             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H  49     0
    B             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H  44     0
    C             GTEX.13X6H.1026.SM.5SIBE  GTEX.13X6H   3     0
    ...
    A             GTEX.13X6I.1026.SM.5SIBE  GTEX.13X6I  10     0
    B             GTEX.13X6I.1026.SM.5SIBE  GTEX.13X6I  11     0
    ...

Now you can paste together the columns count and isNeandertal

Combine column values with partially matching names into one semicolon-separated field in R

Answers (1)

Related Questions