R: Split multiple rows into a list element based on pattern

Question

I'm trying to parse this .txt file in R: https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt

It's essentially a single column data frame of some ~2 million rows, with each entity being described by multiple rows and bookended by rows containing the string "//".

Ideally, I could capture each entity, made up of multiple rows, as a list element by splitting at "//", but I'm not sure of the most efficient way to go about this.

Any help is much appreciated.

EDIT:

Here's a snippet of what I'm working with:

[87] "//"                                                                                                                                                                                             
 [88] "ID   #40a"                                                                                                                                                                                      
 [89] "AC   CVCL_IW91"                                                                                                                                                                                 
 [90] "DR   Wikidata; Q54422071"                                                                                                                                                                       
 [91] "RX   PubMed=28159921;"                                                                                                                                                                          
 [92] "CC   Characteristics: Established from parent cell line after two passages in the peritoneal cavity of C57BL/6 mice (PubMed=28159921)."                                                         
 [93] "CC   Transformant: ChEBI; CHEBI:46666; Crocidolite asbestos."                                                                                                                                   
 [94] "CC   Derived from metastatic site: Peritoneum."                                                                                                                                                 
 [95] "CC   Breed/subspecies: C57BL/6."                                                                                                                                                                
 [96] "DI   NCIt; C21619; Mouse mesothelioma"                                                                                                                                                          
 [97] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
 [98] "HI   CVCL_IW90 ! 40"                                                                                                                                                                            
 [99] "SX   Male"                                                                                                                                                                                      
[100] "AG   1-2M"                                                                                                                                                                                      
[101] "CA   Cancer cell line"                                                                                                                                                                          
[102] "DT   Created: 15-05-17; Last updated: 02-07-20; Version: 3"                                                                                                                                     
[103] "//"                                                                                                                                                                                             
[104] "ID   #490"                                                                                                                                                                                      
[105] "AC   CVCL_B375"                                                                                                                                                                                 
[106] "SY   490; Mab 7; Mab7"                                                                                                                                                                          
[107] "DR   CLO; CLO_0001018"                                                                                                                                                                          
[108] "DR   ATCC; HB-12029"                                                                                                                                                                            
[109] "DR   Wikidata; Q54422073"                                                                                                                                                                       
[110] "RX   Patent=US5616470;"                                                                                                                                                                         
[111] "CC   Monoclonal antibody isotype: IgM, kappa."                                                                                                                                                  
[112] "CC   Monoclonal antibody target: Cronartium ribicola antigens."                                                                                                                                 
[113] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
[114] "HI   CVCL_4032 ! P3X63Ag8.653"                                                                                                                                                                  
[115] "CA   Hybridoma"                                                                                                                                                                                 
[116] "DT   Created: 06-06-12; Last updated: 12-03-20; Version: 6"                                                                                                                                     
[117] "//"                                                                                                                                                                                             
[118] "ID   #822"                                                                                                                                                                                      
[119] "AC   CVCL_X345"                                                                                                                                                                                 
[120] "SY   822; Mab 13; Mab13"                                                                                                                                                                        
[121] "DR   ATCC; HB-12030"                                                                                                                                                                            
[122] "DR   Wikidata; Q54422076"                                                                                                                                                                       
[123] "RX   Patent=US5616470;"                                                                                                                                                                         
[124] "CC   Monoclonal antibody isotype: IgM, kappa."                                                                                                                                                  
[125] "CC   Monoclonal antibody target: Cronartium ribicola antigens."                                                                                                                                 
[126] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
[127] "HI   CVCL_4032 ! P3X63Ag8.653"                                                                                                                                                                  
[128] "CA   Hybridoma"                                                                                                                                                                                 
[129] "DT   Created: 17-07-14; Last updated: 12-03-20; Version: 5"                                                                                                                                     
[130] "//"

As an added clarification, my goal is to search for a given accession (AC), e.g. CVCL_X345, and then extract age (AG) and sex (SX) for that accession if they are available.

user12728748 · Accepted Answer

Here is one solution using data.table.

library(data.table)
dt <- fread("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt", 
            skip=54, header=FALSE, sep='')
dt[, c("code", "content"):=tstrsplit(sub(" +", "@/@", V1), "@/@") ][, 
  `:=` (V1=NULL, ID=cumsum(code=="//")+1)]
dt <- dt[code!="//"]
dt[dt[content=="CVCL_IW91"], on="ID"][code %chin% c("SX", "AG")]
#>    code content ID i.code i.content
#> 1:   SX    Male  3     AC CVCL_IW91
#> 2:   AG    1-2M  3     AC CVCL_IW91

# or get all of them:
dcast(dt[code %in% c("SX", "AG", "AC")][, .(code, content), by=ID], ID ~ ...,
      value.var="content")
#>             ID        AC              AG     SX
#>      1:      1 CVCL_E548 Age unspecified Female
#>      2:      2 CVCL_KA96               
#>      3:      3 CVCL_IW91            1-2M   Male
#>      4:      4 CVCL_B375               
#>      5:      5 CVCL_X345               
#>     ---                                        
#> 128802: 128802 CVCL_A6IX             29Y   Male
#> 128803: 128803 CVCL_ZB29             57Y Female
#> 128804: 128804 CVCL_ZB30             32Y Female
#> 128805: 128805 CVCL_A3ZF             26Y Female
#> 128806: 128806 CVCL_3449               Male

^{Created on 2021-06-01 by the reprex package (v2.0.0)}

Edit: Brief explanation:

In essence, I want to split each row on the first blank(s) first. I achieve this by replacing these with a separator that does not exist in the entire text (previously checked with grep), then use tstrsplit to split the first column V1 based on this separator into two (code and content). Then I remove V1 and use cumsum to increase the identifier ID based on the occurrence of the separator lines (//) to label each record with its own identifier.

R: Split multiple rows into a list element based on pattern

Answers (1)

Related Questions