Reputation: 1088
I have a table named NE
that contains spliced RNA junctions:
#Chr start end ID . + GTEX-Q2AG-0126-SM-2HMLB
1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA . + 0.01122552
2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA . + 0.09151192
3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA . + 0.94156107
4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA . + -1.00545250
5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA . + -0.17101732
6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA . + 0.26907797
GTEX-N7MT-0011-R5a-SM-2I3G6 GTEX-PW2O-0526-SM-2I3DX GTEX-OHPK-0526-SM-2HMJB
1: -0.73425796 0.32721133 -0.05645774
2: 0.83044440 -0.08213476 0.23888779
3: -0.02207567 -1.68168241 1.69042151
4: 0.16780741 1.55309040 -1.83313242
5: -0.96313998 0.96385901 0.40292406
6: 1.00445387 -0.89044547 -0.24664686
I have another table named tissue_table
that contains such information:
Run Sample_Name body_site
1: SRR598484 GTEX-PW2O-0526-SM-2I3DX Lung
2: SRR598124 GTEX-NPJ8-0011-R4a-SM-2HML3 Brain - Amygdala
3: SRR599192 GTEX-N7MT-0011-R5a-SM-2I3G6 Brain - Caudate (basal ganglia)
4: SRR601925 GTEX-OHPK-0526-SM-2HMJB Lung
5: SRR601068 GTEX-Q2AG-0126-SM-2HMLB Skin - Sun Exposed (Lower leg)
6: SRR602598 GTEX-Q2AG-0011-R9A-SM-2HMJ6 Brain - Spinal cord (cervical c-1)
What I want to do is generate new tables from NE
based on tissue_table$body_site
; meaning, I want all rows for each column that match with each tissue type to be output as a file. For example, if GTEX-PW2O-0526-SM-2I3DX
and GTEX-OHPK-0526-SM-2HMJB
both match with 'Lung' in tissue_table$body_site
, I want to make a new table named like LungPhenotypes.txt
that looks like NE
(in that it has the columns #Chr
, start
, end
, and ID
) but contains only information sampled from the lung according to tissue_table
.
Here's the code I already have:
require("data.table")
require("R.utils")
args = commandArgs(trailingOnly=TRUE)
# args[1] is the leafcutter-generated phenotypes, args[2] is the tissue table
NE <- fread("NE_sQTL_perind.counts.gz.qqnorm_chr13")
tistab <- fread("tissue_table.txt")
# below takes the SRR IDs found in NE column headers, matches them to those found in the
# tissue table, and then changes them the GTEX sample ID
ind <- match(names(NE), tistab$Run)
names(NE) <- tistab$Sample_Name[ind]
# I guess now what I want to do is find the tissue that corresponds to each sample, and write
# to file the phenotypes or whatever
This is as far as I've gotten: I was able to change the column headers from their original titles (found in tissue_table$Run
) to those found in tissue_table$Sample_Name
. Otherwise, I don't even know how I would approach this problem. I'm sure it's easy, I just don't have enough familiarity with R to figure it out. Please let me know if I can clarify my issue.
Thank you.
EDIT: As requested, sample data:
> dput(head(NE, 10))
structure(list(`#Chr` = c(13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L), start = c(20244503L, 20244503L, 20249124L, 20249793L,
20251963L, 20283739L, 20803888L, 20803888L, 20803888L, 20804946L
), end = c(20244980L, 20245346L, 20251864L, 20251864L, 20304379L,
20304379L, 20804837L, 20805005L, 20805521L, 20805521L), ID = c("13:20244503:20244980:clu_1587_NA",
"13:20244503:20245346:clu_1587_NA", "13:20249124:20251864:clu_1588_NA",
"13:20249793:20251864:clu_1588_NA", "13:20251963:20304379:clu_1589_NA",
"13:20283739:20304379:clu_1589_NA", "13:20803888:20804837:clu_1590_NA",
"13:20803888:20805005:clu_1590_NA", "13:20803888:20805521:clu_1590_NA",
"13:20804946:20805521:clu_1590_NA"), . = c(".", ".", ".", ".",
".", ".", ".", ".", ".", "."), `+` = c("+", "+", "+", "+", "+",
"+", "+", "+", "+", "+"), `GTEX-Q2AG-0126-SM-2HMLB` = c(0.011225518662542,
0.0915119165352026, 0.941561071760354, -1.00545250297076, -0.171017320204747,
0.269077973405877, 2.26711789363315, -2.79253861638934, -0.732483471764967,
1.14279009708336), `GTEX-N7MT-0011-R5a-SM-2I3G6` = c(-0.734257957140664,
0.830444403564401, -0.0220756713815287, 0.167807411034554, -0.963139984748595,
1.00445387270491, -1.10454772191492, 0.872446843686367, -1.47490517820283,
-1.69929164403211), `GTEX-PW2O-0526-SM-2I3DX` = c(0.327211326349541,
-0.0821347590368987, -1.68168241366976, 1.55309040177528, 0.963859014491135,
-0.890445468211905, 0.126678936291309, 0.135493519239826, 0.126527048968275,
0.14416639182154), `GTEX-OHPK-0526-SM-2HMJB` = c(-0.0564577430398568,
0.238887789085513, 1.69042150820732, -1.83313242253239, 0.402924064571882,
-0.246646862056526, 0.091360610412634, 0.0993070943580591, 0.0912093063816043,
0.107562947531429), `GTEX-OXRL-0526-SM-2I3EZ` = c(0.0674782081460005,
0.158645645285045, 1.78738099166716, -1.88842809909606, 0.508954892349862,
-0.315945651353048, 0.119695085673777, 0.126982719708521, 0.119543328884529,
0.134581189175627), `GTEX-NPJ8-0011-R4a-SM-2HML3` = c(-0.437321982565311,
0.507754687799161, -0.150716233289565, 0.259715866743248, -1.25996459356113,
1.18247794999203, -1.16864600399772, 0.315945651353048, -0.328801349819712,
-1.23212889898317), `GTEX-Q2AG-0011-R9A-SM-2HMJ6` = c(-1.09599155148678,
0.925486945143163, -0.404068240058415, 0.465014413370245, 0.0869736283894613,
-0.0496678054798097, -0.151935513719884, -0.144927678233019,
2.26125148845566, 0.321494053014199), `GTEX-OIZH-0005-SM-2HMJN` = c(2.00669264539791,
-2.53139742741042, -1.17975392273608, 0.834065972618895, 1.47222782753213,
-1.91886672909768, 0.530513153906467, 0.969147205230478, 0.37852951989621,
1.19315578476064), `GTEX-Q2AG-0011-R4A-SM-2HMKA` = c(-0.375779231335771,
0.446292587332684, -0.710978014218879, 0.662901557390466, -1.30771089265708,
1.19715667858446, -0.747740836500599, 0.160171661041644, -0.487460943331342,
-0.816486102660646), `GTEX-OXRK-0926-SM-2HMKP` = c(0.536071533897896,
-0.32498668145395, 0.286146956081191, -0.058419751422225, 0.245249146196741,
-0.0231306654644876, 0.134125066194346, 0.141045986021489, 0.133973031406055,
0.147669008162181)), .Names = c("#Chr", "start", "end", "ID",
".", "+", "GTEX-Q2AG-0126-SM-2HMLB", "GTEX-N7MT-0011-R5a-SM-2I3G6",
"GTEX-PW2O-0526-SM-2I3DX", "GTEX-OHPK-0526-SM-2HMJB", "GTEX-OXRL-0526-SM-2I3EZ",
"GTEX-NPJ8-0011-R4a-SM-2HML3", "GTEX-Q2AG-0011-R9A-SM-2HMJ6",
"GTEX-OIZH-0005-SM-2HMJN", "GTEX-Q2AG-0011-R4A-SM-2HMKA", "GTEX-OXRK-0926-SM-2HMKP"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x1fcb378>)
> dput(head(tistab, 10))
structure(list(Run = c("SRR598484", "SRR598124", "SRR599192",
"SRR601925", "SRR601068", "SRR602598", "SRR607586", "SRR608288",
"SRR600445", "SRR608344"), Sample_Name = c("GTEX-PW2O-0526-SM-2I3DX",
"GTEX-NPJ8-0011-R4a-SM-2HML3", "GTEX-N7MT-0011-R5a-SM-2I3G6",
"GTEX-OHPK-0526-SM-2HMJB", "GTEX-Q2AG-0126-SM-2HMLB", "GTEX-Q2AG-0011-R9A-SM-2HMJ6",
"GTEX-OXRL-0526-SM-2I3EZ", "GTEX-OXRK-0926-SM-2HMKP", "GTEX-Q2AG-0011-R4A-SM-2HMKA",
"GTEX-OIZH-0005-SM-2HMJN"), body_site = c("Lung", "Brain - Amygdala",
"Brain - Caudate (basal ganglia)", "Lung", "Skin - Sun Exposed (Lower leg)",
"Brain - Spinal cord (cervical c-1)", "Lung", "Lung", "Brain - Amygdala",
"Whole Blood")), .Names = c("Run", "Sample_Name", "body_site"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x1fcb378>)
Upvotes: 0
Views: 39
Reputation: 28685
If you split
Sample_Name
by body_site
, you get a vector of Sample_Name
s corresponding to each body_site
. Then you just need to intersect
this with the names of NE
for each body_site
, and select the columns resulting from that intersection. The result is a named list of data tables. The names are the body_site
values.
library(data.table) #not really necessary, just using it here since you're already using it
sites <- with(tistab, split(Sample_Name, body_site))
keep <- c('#Chr', 'start', 'end', 'ID')
lapply(sites, function(x)
NE[, .SD, .SDcols = c(keep, intersect(names(NE), x))])
The lapply code uses a function defined in lapply
. This is sometimes known as using an "anonymous" function. For data tables, .SD
is a data table of all the columns, or all the columns except for the grouping columns if grouping is used, or those specified in .SDcols
if the .SDcols
argument is used. So I'm just using it to select specific columns.
With a regular data frame, you could just do NE[, c(keep, intersect(names(NE), x))]
, but due to the way data.table handles what's inside the brackets, this would give an odd result (try NE[, names(NE)[1:2]]
with a data.table, and then with a regular data frame, to see what I mean).
Printed output
# $`Brain - Amygdala`
# #Chr start end ID GTEX-NPJ8-0011-R4a-SM-2HML3
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA -0.4373220
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA 0.5077547
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA -0.1507162
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA 0.2597159
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA -1.2599646
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA 1.1824779
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA -1.1686460
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA 0.3159457
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA -0.3288013
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA -1.2321289
# GTEX-Q2AG-0011-R4A-SM-2HMKA
# 1: -0.3757792
# 2: 0.4462926
# 3: -0.7109780
# 4: 0.6629016
# 5: -1.3077109
# 6: 1.1971567
# 7: -0.7477408
# 8: 0.1601717
# 9: -0.4874609
# 10: -0.8164861
#
# $`Brain - Caudate (basal ganglia)`
# #Chr start end ID GTEX-N7MT-0011-R5a-SM-2I3G6
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA -0.73425796
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA 0.83044440
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA -0.02207567
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA 0.16780741
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA -0.96313998
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA 1.00445387
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA -1.10454772
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA 0.87244684
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA -1.47490518
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA -1.69929164
#
# $`Brain - Spinal cord (cervical c-1)`
# #Chr start end ID GTEX-Q2AG-0011-R9A-SM-2HMJ6
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA -1.09599155
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA 0.92548695
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA -0.40406824
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA 0.46501441
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA 0.08697363
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA -0.04966781
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA -0.15193551
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA -0.14492768
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA 2.26125149
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA 0.32149405
#
# $Lung
# #Chr start end ID GTEX-PW2O-0526-SM-2I3DX
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA 0.32721133
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA -0.08213476
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA -1.68168241
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA 1.55309040
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA 0.96385901
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA -0.89044547
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA 0.12667894
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA 0.13549352
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA 0.12652705
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA 0.14416639
# GTEX-OHPK-0526-SM-2HMJB GTEX-OXRL-0526-SM-2I3EZ GTEX-OXRK-0926-SM-2HMKP
# 1: -0.05645774 0.06747821 0.53607153
# 2: 0.23888779 0.15864565 -0.32498668
# 3: 1.69042151 1.78738099 0.28614696
# 4: -1.83313242 -1.88842810 -0.05841975
# 5: 0.40292406 0.50895489 0.24524915
# 6: -0.24664686 -0.31594565 -0.02313067
# 7: 0.09136061 0.11969509 0.13412507
# 8: 0.09930709 0.12698272 0.14104599
# 9: 0.09120931 0.11954333 0.13397303
# 10: 0.10756295 0.13458119 0.14766901
#
# $`Skin - Sun Exposed (Lower leg)`
# #Chr start end ID GTEX-Q2AG-0126-SM-2HMLB
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA 0.01122552
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA 0.09151192
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA 0.94156107
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA -1.00545250
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA -0.17101732
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA 0.26907797
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA 2.26711789
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA -2.79253862
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA -0.73248347
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA 1.14279010
#
# $`Whole Blood`
# #Chr start end ID GTEX-OIZH-0005-SM-2HMJN
# 1: 13 20244503 20244980 13:20244503:20244980:clu_1587_NA 2.0066926
# 2: 13 20244503 20245346 13:20244503:20245346:clu_1587_NA -2.5313974
# 3: 13 20249124 20251864 13:20249124:20251864:clu_1588_NA -1.1797539
# 4: 13 20249793 20251864 13:20249793:20251864:clu_1588_NA 0.8340660
# 5: 13 20251963 20304379 13:20251963:20304379:clu_1589_NA 1.4722278
# 6: 13 20283739 20304379 13:20283739:20304379:clu_1589_NA -1.9188667
# 7: 13 20803888 20804837 13:20803888:20804837:clu_1590_NA 0.5305132
# 8: 13 20803888 20805005 13:20803888:20805005:clu_1590_NA 0.9691472
# 9: 13 20803888 20805521 13:20803888:20805521:clu_1590_NA 0.3785295
# 10: 13 20804946 20805521 13:20804946:20805521:clu_1590_NA 1.1931558
Upvotes: 2