a83
a83

Reputation: 31

R: how to get information from a txt file with R

R experts,

I have a large text file, which has specific pattern and format.

My text.txt contains

x1 `xx`nkkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakd`xx`nmm  cataitha`yy`knkcnaktnhakt

x2 `xx`ngkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknkcnaktnhakt 

x3 `xx`nkg,kna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknk`xx`cna`yy`ktnhakt 

x4  nkkndataktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknkcnaktnhakt 

Then, I want to ask R to find a list of words, in this case is x1, x2, x3 and x4 And inbetween, I want to get a list for each of them, that is between "xx" and "yy".

As such, the results will be four lists

x1 = c("nkkna", "nmm  cataitha")
x2 = c("ngkna")
x3 = c("nkg,kna", "cna")
x4 = c("NA")

However, I am facing two problems would like to ask for your help.

x <- read.csv(textConnection"xxx") may help, but the problem is my file is too large to be copy and past, and the file should be be readin as csv. Are there any much better way to load my text file to R as an object that can be search and grep afterwards?

I learn strsplit maybe used, it seems to work in RCurl scrapped materials, does it work here too? If yes, could you mind to teach me how?

Thank you so much.....

Upvotes: 3

Views: 469

Answers (1)

Andrie
Andrie

Reputation: 179398

To answer your first question, to read a text file you should use the function scan(). The references you see on SO to textConnection are purely to read in some example data that is pasted into the console. This is what I am doing next to read your data:

txt <- "
x1 `xx`nkkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakd`xx`nmm  cataitha`yy`knkcnaktnhakt
x2 `xx`ngkna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknkcnaktnhakt 
x3 `xx`nkg,kna`yy`taktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknk`xx`cna`yy`ktnhakt 
x4  nkkndataktnaknvcaklrhkahnktn, altlkhakthakdnmm  cataithaknkcnaktnhakt"

dtxt <- textConnection(txt)

Then I use scan in the same way to read the textConnetion data. In your own code, you should modify the following line, so tat dtxt is your file location. I keep it in this format, so that other people can replicate my results without having to create a file on their own file system:

dat <- scan(dtxt, what="character", sep="\n")

Now that you have read the data, it is a (somewhat complicated) call to sapply, strsplit and gsub to manipulate the data.

sapply(seq_along(dat), 
    function(i)unlist(c(sapply(strsplit(dat[i], "`xx`"), 
              function(x)gsub("^(.*?)`.*", "\\1", x)[-1]))))

The results are exactly as you specified:

[[1]]
[1] "nkkna"         "nmm  cataitha"

[[2]]
[1] "ngkna"

[[3]]
[1] "nkg,kna" "cna"    

[[4]]
character(0)

Upvotes: 9

Related Questions