Reputation: 31
R experts,
I have a large text file, which has specific pattern and format.
My text.txt contains
x1 `xx`nkkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakd`xx`nmm cataitha`yy`knkcnaktnhakt
x2 `xx`ngkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknkcnaktnhakt
x3 `xx`nkg,kna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknk`xx`cna`yy`ktnhakt
x4 nkkndataktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknkcnaktnhakt
Then, I want to ask R to find a list of words, in this case is x1, x2, x3 and x4 And inbetween, I want to get a list for each of them, that is between "xx" and "yy".
As such, the results will be four lists
x1 = c("nkkna", "nmm cataitha")
x2 = c("ngkna")
x3 = c("nkg,kna", "cna")
x4 = c("NA")
However, I am facing two problems would like to ask for your help.
x <- read.csv(textConnection"xxx") may help, but the problem is my file is too large to be copy and past, and the file should be be readin as csv. Are there any much better way to load my text file to R as an object that can be search and grep afterwards?
I learn strsplit maybe used, it seems to work in RCurl scrapped materials, does it work here too? If yes, could you mind to teach me how?
Thank you so much.....
Upvotes: 3
Views: 469
Reputation: 179398
To answer your first question, to read a text file you should use the function scan()
. The references you see on SO to textConnection
are purely to read in some example data that is pasted into the console. This is what I am doing next to read your data:
txt <- "
x1 `xx`nkkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakd`xx`nmm cataitha`yy`knkcnaktnhakt
x2 `xx`ngkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknkcnaktnhakt
x3 `xx`nkg,kna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknk`xx`cna`yy`ktnhakt
x4 nkkndataktnaknvcaklrhkahnktn, altlkhakthakdnmm cataithaknkcnaktnhakt"
dtxt <- textConnection(txt)
Then I use scan
in the same way to read the textConnetion data. In your own code, you should modify the following line, so tat dtxt is your file location. I keep it in this format, so that other people can replicate my results without having to create a file on their own file system:
dat <- scan(dtxt, what="character", sep="\n")
Now that you have read the data, it is a (somewhat complicated) call to sapply
, strsplit
and gsub
to manipulate the data.
sapply(seq_along(dat),
function(i)unlist(c(sapply(strsplit(dat[i], "`xx`"),
function(x)gsub("^(.*?)`.*", "\\1", x)[-1]))))
The results are exactly as you specified:
[[1]]
[1] "nkkna" "nmm cataitha"
[[2]]
[1] "ngkna"
[[3]]
[1] "nkg,kna" "cna"
[[4]]
character(0)
Upvotes: 9