Kvothe
Kvothe

Reputation: 79

Opening txt file in record format in R

I'm looking to read a txt file in record format in R as a dataframe with each row corresponding to a single record. Records are of varying lengths. Any idea how I do this?

This is the first record:

# C. elegans orthologs      
# WormBase version: WS241       
# Generated:        
# File is in record format with records separated by "=\n"      
#      Sample Record        
#      WBGeneID \t PublicName \n        
#      Species \t Ortholog \t MethodsUsedToAssignOrtholog \n        
# BEGIN CONTENTS        
=       
WBGene00000001  aap-1   
Ascaris suum    GS_11030    WormBase-Compara
Brugia malayi   WBGene00227541  WormBase-Compara
Bursephelenchus xylophilus  BUX.s00055.227  WormBase-Compara
Caenorhabditis angaria  Cang_2012_03_13_00205.g6964.t3  WormBase-Compara
Caenorhabditis brenneri WBGene00194098  TreeFam; WormBase-Compara
Caenorhabditis briggsae WBGene00032086  Hillier-set; OrthoMCL; Inparanoid_7; OMA;     WormBase-Compara
Caenorhabditis japonica WBGene00207613  WormBase-Compara
Caenorhabditis remanei  WBGene00069407  Inparanoid_7; OMA; TreeFam; WormBase-Compara
Caenorhabditis sp.11    Csp11.Scaffold542.g3421.t1  WormBase-Compara
Caenorhabditis sp.5 Csp5_scaffold_00676.g14307.t1   WormBase-Compara
Danio rerio ENSEMBL:ENSDARP00000056212  TreeFam
Dirofilaria immitis nDi.2.2.2.t01810    WormBase-Compara
Drosophila melanogaster ENSEMBL:FBpp0303635 EnsEMBL-Compara; TreeFam
Haemonchus contortus    HCOI02027400.t1 WormBase-Compara
Heterorhabditis bacteriophora   Hba_15363   WormBase-Compara
Homo sapiens    ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
Loa loa EFO26046.2  WormBase-Compara
Meloidogyne hapla   MhA1_Contig1573.frz3.gene15 WormBase-Compara
Mus musculus    ENSEMBL:ENSMUSP00000034296  EnsEMBL-Compara; TreeFam
Onchocerca volvulus WBGene00241206  WormBase-Compara
Panagrellus redivivus   Pan_g2405.t1    WormBase-Compara
Pristionchus pacificus  WBGene00117228  Inparanoid_7; OMA; WormBase-Compara
Trichinella spiralis    EFV56516    WormBase-Compara
=       
WBGene00000002  aat-1   
Ascaris suum    GS_20881    WormBase-Compara

Edit: All I really need from each record is the entry corresponding to "Homo Sapiens". So, ideally, my df in R would be:

WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam 
WBGene00000002 aat-1 etc etc

Upvotes: 3

Views: 147

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

I would recommend using readLines to read the data into R. Since you gave us the file path in the comments, use file to open the connection to the file first, then readLines. And it's always good practice to close the connection after we've read and stored the data into R.

> con <- file("../Input/c_elegans.PRJNA13758.current.best_blastp_hits.txt", 
              open = "r")
> XX <- readLines(con)
> close(con)

> record <- grep("^WBGene", XX, value = TRUE)
> sapien <- grep("Homo sapiens", XX, value = TRUE, fixed = TRUE)
> gsub("\\s+", " ", paste0(record[1], sapien))
## [1] "WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam"

The entire record vector for your sample data is

> record
## [1] "WBGene00000001  aap-1   " "WBGene00000002  aat-1   "

So when we find the homo sapien for record 2, it will be pasted to record 2, sapien 3 to record 3, and so on with

paste0(record, sapien)

Worth noting that the OP's data frame was finally created with

do.call(rbind, strsplit(paste0(record, sapien), split = "\\s+"))

Upvotes: 1

luis_js
luis_js

Reputation: 611

This might work too, using "scan":

dat <- matrix(unlist(scan(file     = "data",
                      what         = list(""),
                      sep          = "\n",
                      skip         = 8, # file header
                      multi.line   = FALSE)),
          ncol  = 25, # one record span 25 lines
          byrow = TRUE)
paste(dat[,2], dat[,18])

Each full line is treated as a field. Each row of dat is a record, where each column is a line. (if needed, it can be split by each '\t').

Finally I combine columns 2 and 18, the ones of interest.

Upvotes: 0

Related Questions