Reputation: 79
I'm looking to read a txt file in record format in R as a dataframe with each row corresponding to a single record. Records are of varying lengths. Any idea how I do this?
This is the first record:
# C. elegans orthologs
# WormBase version: WS241
# Generated:
# File is in record format with records separated by "=\n"
# Sample Record
# WBGeneID \t PublicName \n
# Species \t Ortholog \t MethodsUsedToAssignOrtholog \n
# BEGIN CONTENTS
=
WBGene00000001 aap-1
Ascaris suum GS_11030 WormBase-Compara
Brugia malayi WBGene00227541 WormBase-Compara
Bursephelenchus xylophilus BUX.s00055.227 WormBase-Compara
Caenorhabditis angaria Cang_2012_03_13_00205.g6964.t3 WormBase-Compara
Caenorhabditis brenneri WBGene00194098 TreeFam; WormBase-Compara
Caenorhabditis briggsae WBGene00032086 Hillier-set; OrthoMCL; Inparanoid_7; OMA; WormBase-Compara
Caenorhabditis japonica WBGene00207613 WormBase-Compara
Caenorhabditis remanei WBGene00069407 Inparanoid_7; OMA; TreeFam; WormBase-Compara
Caenorhabditis sp.11 Csp11.Scaffold542.g3421.t1 WormBase-Compara
Caenorhabditis sp.5 Csp5_scaffold_00676.g14307.t1 WormBase-Compara
Danio rerio ENSEMBL:ENSDARP00000056212 TreeFam
Dirofilaria immitis nDi.2.2.2.t01810 WormBase-Compara
Drosophila melanogaster ENSEMBL:FBpp0303635 EnsEMBL-Compara; TreeFam
Haemonchus contortus HCOI02027400.t1 WormBase-Compara
Heterorhabditis bacteriophora Hba_15363 WormBase-Compara
Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
Loa loa EFO26046.2 WormBase-Compara
Meloidogyne hapla MhA1_Contig1573.frz3.gene15 WormBase-Compara
Mus musculus ENSEMBL:ENSMUSP00000034296 EnsEMBL-Compara; TreeFam
Onchocerca volvulus WBGene00241206 WormBase-Compara
Panagrellus redivivus Pan_g2405.t1 WormBase-Compara
Pristionchus pacificus WBGene00117228 Inparanoid_7; OMA; WormBase-Compara
Trichinella spiralis EFV56516 WormBase-Compara
=
WBGene00000002 aat-1
Ascaris suum GS_20881 WormBase-Compara
Edit: All I really need from each record is the entry corresponding to "Homo Sapiens". So, ideally, my df in R would be:
WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
WBGene00000002 aat-1 etc etc
Upvotes: 3
Views: 147
Reputation: 99331
I would recommend using readLines
to read the data into R. Since you gave us the file path in the comments, use file
to open the connection to the file first, then readLines
. And it's always good practice to close
the connection after we've read and stored the data into R.
> con <- file("../Input/c_elegans.PRJNA13758.current.best_blastp_hits.txt",
open = "r")
> XX <- readLines(con)
> close(con)
> record <- grep("^WBGene", XX, value = TRUE)
> sapien <- grep("Homo sapiens", XX, value = TRUE, fixed = TRUE)
> gsub("\\s+", " ", paste0(record[1], sapien))
## [1] "WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam"
The entire record
vector for your sample data is
> record
## [1] "WBGene00000001 aap-1 " "WBGene00000002 aat-1 "
So when we find the homo sapien for record 2, it will be pasted to record 2, sapien 3 to record 3, and so on with
paste0(record, sapien)
Worth noting that the OP's data frame was finally created with
do.call(rbind, strsplit(paste0(record, sapien), split = "\\s+"))
Upvotes: 1
Reputation: 611
This might work too, using "scan":
dat <- matrix(unlist(scan(file = "data",
what = list(""),
sep = "\n",
skip = 8, # file header
multi.line = FALSE)),
ncol = 25, # one record span 25 lines
byrow = TRUE)
paste(dat[,2], dat[,18])
Each full line is treated as a field. Each row of dat is a record, where each column is a line. (if needed, it can be split by each '\t').
Finally I combine columns 2 and 18, the ones of interest.
Upvotes: 0