Reputation: 195
I'm trying to parse this .txt file in R: https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt
It's essentially a single column data frame of some ~2 million rows, with each entity being described by multiple rows and bookended by rows containing the string "//".
Ideally, I could capture each entity, made up of multiple rows, as a list element by splitting at "//", but I'm not sure of the most efficient way to go about this.
Any help is much appreciated.
EDIT:
Here's a snippet of what I'm working with:
[87] "//"
[88] "ID #40a"
[89] "AC CVCL_IW91"
[90] "DR Wikidata; Q54422071"
[91] "RX PubMed=28159921;"
[92] "CC Characteristics: Established from parent cell line after two passages in the peritoneal cavity of C57BL/6 mice (PubMed=28159921)."
[93] "CC Transformant: ChEBI; CHEBI:46666; Crocidolite asbestos."
[94] "CC Derived from metastatic site: Peritoneum."
[95] "CC Breed/subspecies: C57BL/6."
[96] "DI NCIt; C21619; Mouse mesothelioma"
[97] "OX NCBI_TaxID=10090; ! Mus musculus"
[98] "HI CVCL_IW90 ! 40"
[99] "SX Male"
[100] "AG 1-2M"
[101] "CA Cancer cell line"
[102] "DT Created: 15-05-17; Last updated: 02-07-20; Version: 3"
[103] "//"
[104] "ID #490"
[105] "AC CVCL_B375"
[106] "SY 490; Mab 7; Mab7"
[107] "DR CLO; CLO_0001018"
[108] "DR ATCC; HB-12029"
[109] "DR Wikidata; Q54422073"
[110] "RX Patent=US5616470;"
[111] "CC Monoclonal antibody isotype: IgM, kappa."
[112] "CC Monoclonal antibody target: Cronartium ribicola antigens."
[113] "OX NCBI_TaxID=10090; ! Mus musculus"
[114] "HI CVCL_4032 ! P3X63Ag8.653"
[115] "CA Hybridoma"
[116] "DT Created: 06-06-12; Last updated: 12-03-20; Version: 6"
[117] "//"
[118] "ID #822"
[119] "AC CVCL_X345"
[120] "SY 822; Mab 13; Mab13"
[121] "DR ATCC; HB-12030"
[122] "DR Wikidata; Q54422076"
[123] "RX Patent=US5616470;"
[124] "CC Monoclonal antibody isotype: IgM, kappa."
[125] "CC Monoclonal antibody target: Cronartium ribicola antigens."
[126] "OX NCBI_TaxID=10090; ! Mus musculus"
[127] "HI CVCL_4032 ! P3X63Ag8.653"
[128] "CA Hybridoma"
[129] "DT Created: 17-07-14; Last updated: 12-03-20; Version: 5"
[130] "//"
As an added clarification, my goal is to search for a given accession (AC), e.g. CVCL_X345, and then extract age (AG) and sex (SX) for that accession if they are available.
Upvotes: 0
Views: 432
Reputation: 8506
Here is one solution using data.table
.
library(data.table)
dt <- fread("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt",
skip=54, header=FALSE, sep='')
dt[, c("code", "content"):=tstrsplit(sub(" +", "@/@", V1), "@/@") ][,
`:=` (V1=NULL, ID=cumsum(code=="//")+1)]
dt <- dt[code!="//"]
dt[dt[content=="CVCL_IW91"], on="ID"][code %chin% c("SX", "AG")]
#> code content ID i.code i.content
#> 1: SX Male 3 AC CVCL_IW91
#> 2: AG 1-2M 3 AC CVCL_IW91
# or get all of them:
dcast(dt[code %in% c("SX", "AG", "AC")][, .(code, content), by=ID], ID ~ ...,
value.var="content")
#> ID AC AG SX
#> 1: 1 CVCL_E548 Age unspecified Female
#> 2: 2 CVCL_KA96 <NA> <NA>
#> 3: 3 CVCL_IW91 1-2M Male
#> 4: 4 CVCL_B375 <NA> <NA>
#> 5: 5 CVCL_X345 <NA> <NA>
#> ---
#> 128802: 128802 CVCL_A6IX 29Y Male
#> 128803: 128803 CVCL_ZB29 57Y Female
#> 128804: 128804 CVCL_ZB30 32Y Female
#> 128805: 128805 CVCL_A3ZF 26Y Female
#> 128806: 128806 CVCL_3449 <NA> Male
Created on 2021-06-01 by the reprex package (v2.0.0)
Edit: Brief explanation:
In essence, I want to split each row on the first blank(s) first. I achieve this by replacing these with a separator that does not exist in the entire text (previously checked with grep
), then use tstrsplit
to split the first column V1
based on this separator into two (code
and content
). Then I remove V1
and use cumsum
to increase the identifier ID
based on the occurrence of the separator lines (//
) to label each record with its own identifier.
Upvotes: 2