Reputation: 1088
I've looked through SO and have not found any advice that accurately explains what I am looking for.
I have a giant table. The first few columns have information about different expressed transcripts and the SNP which influences it. The remainder of the columns (of which there are around a thousand) are either information about an individual's tissue sample (with a column header such as GTEX.11DXX.1426.SM.5GIDU
) or the individual's ID (GTEX.11DXX
). The information under these columns contain either the number of transcripts expressed (e.g. 92
) at that particular sequence and a binary value representing whether the allele that influences the expression of that transcript is Neandertal inherited or not (1 or 0
), respectively.
What I want to do is consolidate the data underneath the binary columns with the data underneath the transcript number columns like so:
GTEX.11DXX.1426.SM.5GIDU
0;25
1;74
1;104
1;92
0;12
...
etc.
I want to accomplish this by partially matching the column name GTEX.11DXX
with GTEX.11DXX.1426.SM.5GIDU
, and then getting rid of binary columns so it's just the long column names.
I've tried using tidyverse
's map(v, ~select_(ovary, ~matches(.)))
, and it kind of works, but that matches even if a one character is off, like so:
[[49]]
GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H GTEX.13X6I GTEX.13X6J GTEX.13X6K
1: 49 0 0 0 1
2: 44 0 0 0 1
3: 3 0 0 0 1
4: 23 0 0 0 1
5: 78 0 0 0 1
---
80285: 84 1 0 0 0
80286: 1 1 0 0 0
80287: 0 1 0 0 0
80288: 152 1 0 0 0
80289: 120 1 0 0 0
Again, I want to to work like this:
GTEX.13X6H.1026.SM.5SIBE
1: 0;49
2: 0;44
3: 0;3
4: 0;23
Thank you
Upvotes: 0
Views: 44
Reputation: 3235
Your problems come because your data set is
Therefore, put the tissue names and individual names into their own column instead of using them as column names! This is described in the following:
Create one data set (I call it dataset1
) that has only information on allele counts. Do that by removing unnessecary columns. Only the following columns should remain:
GTEX.11DXX.1426.SM.5GIDU
Create another data set (-> dataset2
) similarly to above, but it includes only:
GTEX.11DXX
Use tidyr::gather
or the newer tidyr::pivot*
(that I have no experience with yet) to get dataset1
look like this
transcript tissue count
A GTEX.13X6H.1026.SM.5SIBE 49
B GTEX.13X6H.1026.SM.5SIBE 44
C GTEX.13X6H.1026.SM.5SIBE 3
...
A GTEX.13X6I.1026.SM.5SIBE 10
B GTEX.13X6I.1026.SM.5SIBE 11
...
and dataset2
like this:
transcript individual isNeandertal
A GTEX.13X6H 0
B GTEX.13X6H 0
...
A GTEX.13X6I 0
B GTEX.13X6I 0
...
In data set 1, get the individual name from the tissue name, making the data set look like this:
transcript tissue individual count
A GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 49
B GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 44
C GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 3
...
A GTEX.13X6I.1026.SM.5SIBE GTEX.13X6I 10
B GTEX.13X6I.1026.SM.5SIBE GTEX.13X6I 11
...
Now you can use dplyr::full_join(dataset1, dataset2, by = c("transcript", "individual"))
to merge the allele counts and the neandertal status.
This works only if each combination of (transcript, individual) occurs only once per data set! If these columns have the same combination in multiple rows, full_join
does not know which rows from dataset1
and dataset2
belong together and will output all combinations of them!
Check this by
stopifnot(!anyDuplicated(dataset1[, c("transcript", "individual")]))
and
stopifnot(!anyDuplicated(dataset2[, c("transcript", "individual")]))
(with your
variable names, of course)
The result of this step should look like this:
transcript tissue individual count isNeandertal
A GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 49 0
B GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 44 0
C GTEX.13X6H.1026.SM.5SIBE GTEX.13X6H 3 0
...
A GTEX.13X6I.1026.SM.5SIBE GTEX.13X6I 10 0
B GTEX.13X6I.1026.SM.5SIBE GTEX.13X6I 11 0
...
count
and isNeandertal
Upvotes: 1