CelineDion
CelineDion

Reputation: 1088

How do I split a table into several new tables based on whether the column header matches with information found in another table? (R)

I have a table named NE that contains spliced RNA junctions:

   #Chr    start      end                               ID . + GTEX-Q2AG-0126-SM-2HMLB
1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA . +              0.01122552
2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA . +              0.09151192
3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA . +              0.94156107
4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA . +             -1.00545250
5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA . +             -0.17101732
6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA . +              0.26907797
   GTEX-N7MT-0011-R5a-SM-2I3G6 GTEX-PW2O-0526-SM-2I3DX GTEX-OHPK-0526-SM-2HMJB
1:                 -0.73425796              0.32721133             -0.05645774
2:                  0.83044440             -0.08213476              0.23888779
3:                 -0.02207567             -1.68168241              1.69042151
4:                  0.16780741              1.55309040             -1.83313242
5:                 -0.96313998              0.96385901              0.40292406
6:                  1.00445387             -0.89044547             -0.24664686

I have another table named tissue_table that contains such information:

         Run                 Sample_Name                          body_site
1: SRR598484     GTEX-PW2O-0526-SM-2I3DX                               Lung
2: SRR598124 GTEX-NPJ8-0011-R4a-SM-2HML3                   Brain - Amygdala
3: SRR599192 GTEX-N7MT-0011-R5a-SM-2I3G6    Brain - Caudate (basal ganglia)
4: SRR601925     GTEX-OHPK-0526-SM-2HMJB                               Lung
5: SRR601068     GTEX-Q2AG-0126-SM-2HMLB     Skin - Sun Exposed (Lower leg)
6: SRR602598 GTEX-Q2AG-0011-R9A-SM-2HMJ6 Brain - Spinal cord (cervical c-1)

What I want to do is generate new tables from NE based on tissue_table$body_site; meaning, I want all rows for each column that match with each tissue type to be output as a file. For example, if GTEX-PW2O-0526-SM-2I3DX and GTEX-OHPK-0526-SM-2HMJB both match with 'Lung' in tissue_table$body_site, I want to make a new table named like LungPhenotypes.txtthat looks like NE (in that it has the columns #Chr, start, end, and ID) but contains only information sampled from the lung according to tissue_table.

Here's the code I already have:

require("data.table")
require("R.utils")
args = commandArgs(trailingOnly=TRUE)
# args[1] is the leafcutter-generated phenotypes, args[2] is the tissue table
NE <- fread("NE_sQTL_perind.counts.gz.qqnorm_chr13")

tistab <- fread("tissue_table.txt")

# below takes the SRR IDs found in NE column headers, matches them to those found in the
# tissue table, and then changes them the GTEX sample ID
ind <- match(names(NE), tistab$Run)
names(NE) <- tistab$Sample_Name[ind]

# I guess now what I want to do is find the tissue that corresponds to each sample, and write
# to file the phenotypes or whatever

This is as far as I've gotten: I was able to change the column headers from their original titles (found in tissue_table$Run) to those found in tissue_table$Sample_Name. Otherwise, I don't even know how I would approach this problem. I'm sure it's easy, I just don't have enough familiarity with R to figure it out. Please let me know if I can clarify my issue.

Thank you.

EDIT: As requested, sample data:

> dput(head(NE, 10))

structure(list(`#Chr` = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L), start = c(20244503L, 20244503L, 20249124L, 20249793L, 
20251963L, 20283739L, 20803888L, 20803888L, 20803888L, 20804946L
), end = c(20244980L, 20245346L, 20251864L, 20251864L, 20304379L, 
20304379L, 20804837L, 20805005L, 20805521L, 20805521L), ID = c("13:20244503:20244980:clu_1587_NA", 
"13:20244503:20245346:clu_1587_NA", "13:20249124:20251864:clu_1588_NA", 
"13:20249793:20251864:clu_1588_NA", "13:20251963:20304379:clu_1589_NA", 
"13:20283739:20304379:clu_1589_NA", "13:20803888:20804837:clu_1590_NA", 
"13:20803888:20805005:clu_1590_NA", "13:20803888:20805521:clu_1590_NA", 
"13:20804946:20805521:clu_1590_NA"), . = c(".", ".", ".", ".", 
".", ".", ".", ".", ".", "."), `+` = c("+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+"), `GTEX-Q2AG-0126-SM-2HMLB` = c(0.011225518662542, 
0.0915119165352026, 0.941561071760354, -1.00545250297076, -0.171017320204747, 
0.269077973405877, 2.26711789363315, -2.79253861638934, -0.732483471764967, 
1.14279009708336), `GTEX-N7MT-0011-R5a-SM-2I3G6` = c(-0.734257957140664, 
0.830444403564401, -0.0220756713815287, 0.167807411034554, -0.963139984748595, 
1.00445387270491, -1.10454772191492, 0.872446843686367, -1.47490517820283, 
-1.69929164403211), `GTEX-PW2O-0526-SM-2I3DX` = c(0.327211326349541, 
-0.0821347590368987, -1.68168241366976, 1.55309040177528, 0.963859014491135, 
-0.890445468211905, 0.126678936291309, 0.135493519239826, 0.126527048968275, 
0.14416639182154), `GTEX-OHPK-0526-SM-2HMJB` = c(-0.0564577430398568, 
0.238887789085513, 1.69042150820732, -1.83313242253239, 0.402924064571882, 
-0.246646862056526, 0.091360610412634, 0.0993070943580591, 0.0912093063816043, 
0.107562947531429), `GTEX-OXRL-0526-SM-2I3EZ` = c(0.0674782081460005, 
0.158645645285045, 1.78738099166716, -1.88842809909606, 0.508954892349862, 
-0.315945651353048, 0.119695085673777, 0.126982719708521, 0.119543328884529, 
0.134581189175627), `GTEX-NPJ8-0011-R4a-SM-2HML3` = c(-0.437321982565311, 
0.507754687799161, -0.150716233289565, 0.259715866743248, -1.25996459356113, 
1.18247794999203, -1.16864600399772, 0.315945651353048, -0.328801349819712, 
-1.23212889898317), `GTEX-Q2AG-0011-R9A-SM-2HMJ6` = c(-1.09599155148678, 
0.925486945143163, -0.404068240058415, 0.465014413370245, 0.0869736283894613, 
-0.0496678054798097, -0.151935513719884, -0.144927678233019, 
2.26125148845566, 0.321494053014199), `GTEX-OIZH-0005-SM-2HMJN` = c(2.00669264539791, 
-2.53139742741042, -1.17975392273608, 0.834065972618895, 1.47222782753213, 
-1.91886672909768, 0.530513153906467, 0.969147205230478, 0.37852951989621, 
1.19315578476064), `GTEX-Q2AG-0011-R4A-SM-2HMKA` = c(-0.375779231335771, 
0.446292587332684, -0.710978014218879, 0.662901557390466, -1.30771089265708, 
1.19715667858446, -0.747740836500599, 0.160171661041644, -0.487460943331342, 
-0.816486102660646), `GTEX-OXRK-0926-SM-2HMKP` = c(0.536071533897896, 
-0.32498668145395, 0.286146956081191, -0.058419751422225, 0.245249146196741, 
-0.0231306654644876, 0.134125066194346, 0.141045986021489, 0.133973031406055, 
0.147669008162181)), .Names = c("#Chr", "start", "end", "ID", 
".", "+", "GTEX-Q2AG-0126-SM-2HMLB", "GTEX-N7MT-0011-R5a-SM-2I3G6", 
"GTEX-PW2O-0526-SM-2I3DX", "GTEX-OHPK-0526-SM-2HMJB", "GTEX-OXRL-0526-SM-2I3EZ", 
"GTEX-NPJ8-0011-R4a-SM-2HML3", "GTEX-Q2AG-0011-R9A-SM-2HMJ6", 
"GTEX-OIZH-0005-SM-2HMJN", "GTEX-Q2AG-0011-R4A-SM-2HMKA", "GTEX-OXRK-0926-SM-2HMKP"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x1fcb378>)

> dput(head(tistab, 10))

structure(list(Run = c("SRR598484", "SRR598124", "SRR599192", 
"SRR601925", "SRR601068", "SRR602598", "SRR607586", "SRR608288", 
"SRR600445", "SRR608344"), Sample_Name = c("GTEX-PW2O-0526-SM-2I3DX", 
"GTEX-NPJ8-0011-R4a-SM-2HML3", "GTEX-N7MT-0011-R5a-SM-2I3G6", 
"GTEX-OHPK-0526-SM-2HMJB", "GTEX-Q2AG-0126-SM-2HMLB", "GTEX-Q2AG-0011-R9A-SM-2HMJ6", 
"GTEX-OXRL-0526-SM-2I3EZ", "GTEX-OXRK-0926-SM-2HMKP", "GTEX-Q2AG-0011-R4A-SM-2HMKA", 
"GTEX-OIZH-0005-SM-2HMJN"), body_site = c("Lung", "Brain - Amygdala", 
"Brain - Caudate (basal ganglia)", "Lung", "Skin - Sun Exposed (Lower leg)", 
"Brain - Spinal cord (cervical c-1)", "Lung", "Lung", "Brain - Amygdala", 
"Whole Blood")), .Names = c("Run", "Sample_Name", "body_site"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x1fcb378>)

Upvotes: 0

Views: 39

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28685

If you split Sample_Name by body_site, you get a vector of Sample_Names corresponding to each body_site. Then you just need to intersect this with the names of NE for each body_site, and select the columns resulting from that intersection. The result is a named list of data tables. The names are the body_site values.

library(data.table) #not really necessary, just using it here since you're already using it

sites <- with(tistab, split(Sample_Name, body_site))

keep <- c('#Chr', 'start', 'end', 'ID')

lapply(sites, function(x) 
  NE[, .SD, .SDcols = c(keep, intersect(names(NE), x))])

The lapply code uses a function defined in lapply. This is sometimes known as using an "anonymous" function. For data tables, .SD is a data table of all the columns, or all the columns except for the grouping columns if grouping is used, or those specified in .SDcols if the .SDcols argument is used. So I'm just using it to select specific columns.

With a regular data frame, you could just do NE[, c(keep, intersect(names(NE), x))], but due to the way data.table handles what's inside the brackets, this would give an odd result (try NE[, names(NE)[1:2]] with a data.table, and then with a regular data frame, to see what I mean).

Printed output

# $`Brain - Amygdala`
#     #Chr    start      end                               ID GTEX-NPJ8-0011-R4a-SM-2HML3
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA                  -0.4373220
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA                   0.5077547
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA                  -0.1507162
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA                   0.2597159
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA                  -1.2599646
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA                   1.1824779
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA                  -1.1686460
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA                   0.3159457
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA                  -0.3288013
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA                  -1.2321289
#     GTEX-Q2AG-0011-R4A-SM-2HMKA
#  1:                  -0.3757792
#  2:                   0.4462926
#  3:                  -0.7109780
#  4:                   0.6629016
#  5:                  -1.3077109
#  6:                   1.1971567
#  7:                  -0.7477408
#  8:                   0.1601717
#  9:                  -0.4874609
# 10:                  -0.8164861
# 
# $`Brain - Caudate (basal ganglia)`
#     #Chr    start      end                               ID GTEX-N7MT-0011-R5a-SM-2I3G6
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA                 -0.73425796
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA                  0.83044440
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA                 -0.02207567
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA                  0.16780741
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA                 -0.96313998
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA                  1.00445387
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA                 -1.10454772
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA                  0.87244684
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA                 -1.47490518
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA                 -1.69929164
# 
# $`Brain - Spinal cord (cervical c-1)`
#     #Chr    start      end                               ID GTEX-Q2AG-0011-R9A-SM-2HMJ6
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA                 -1.09599155
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA                  0.92548695
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA                 -0.40406824
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA                  0.46501441
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA                  0.08697363
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA                 -0.04966781
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA                 -0.15193551
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA                 -0.14492768
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA                  2.26125149
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA                  0.32149405
# 
# $Lung
#     #Chr    start      end                               ID GTEX-PW2O-0526-SM-2I3DX
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA              0.32721133
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA             -0.08213476
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA             -1.68168241
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA              1.55309040
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA              0.96385901
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA             -0.89044547
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA              0.12667894
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA              0.13549352
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA              0.12652705
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA              0.14416639
#     GTEX-OHPK-0526-SM-2HMJB GTEX-OXRL-0526-SM-2I3EZ GTEX-OXRK-0926-SM-2HMKP
#  1:             -0.05645774              0.06747821              0.53607153
#  2:              0.23888779              0.15864565             -0.32498668
#  3:              1.69042151              1.78738099              0.28614696
#  4:             -1.83313242             -1.88842810             -0.05841975
#  5:              0.40292406              0.50895489              0.24524915
#  6:             -0.24664686             -0.31594565             -0.02313067
#  7:              0.09136061              0.11969509              0.13412507
#  8:              0.09930709              0.12698272              0.14104599
#  9:              0.09120931              0.11954333              0.13397303
# 10:              0.10756295              0.13458119              0.14766901
# 
# $`Skin - Sun Exposed (Lower leg)`
#     #Chr    start      end                               ID GTEX-Q2AG-0126-SM-2HMLB
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA              0.01122552
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA              0.09151192
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA              0.94156107
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA             -1.00545250
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA             -0.17101732
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA              0.26907797
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA              2.26711789
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA             -2.79253862
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA             -0.73248347
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA              1.14279010
# 
# $`Whole Blood`
#     #Chr    start      end                               ID GTEX-OIZH-0005-SM-2HMJN
#  1:   13 20244503 20244980 13:20244503:20244980:clu_1587_NA               2.0066926
#  2:   13 20244503 20245346 13:20244503:20245346:clu_1587_NA              -2.5313974
#  3:   13 20249124 20251864 13:20249124:20251864:clu_1588_NA              -1.1797539
#  4:   13 20249793 20251864 13:20249793:20251864:clu_1588_NA               0.8340660
#  5:   13 20251963 20304379 13:20251963:20304379:clu_1589_NA               1.4722278
#  6:   13 20283739 20304379 13:20283739:20304379:clu_1589_NA              -1.9188667
#  7:   13 20803888 20804837 13:20803888:20804837:clu_1590_NA               0.5305132
#  8:   13 20803888 20805005 13:20803888:20805005:clu_1590_NA               0.9691472
#  9:   13 20803888 20805521 13:20803888:20805521:clu_1590_NA               0.3785295
# 10:   13 20804946 20805521 13:20804946:20805521:clu_1590_NA               1.1931558

Upvotes: 2

Related Questions