Parsing txt files and extracting information in R

Question

I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.

I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?

Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:

all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation:    ",all_data)
all_data[hours_of_operation]
[1] "    Annual Hours of Operation:    8760.0 hours/yr"

Thanks

J

Ruthger Righart · Accepted Answer

The following may help. I assume that you brought your text to character vector(s)

Data example

Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)

ex<-c("The annual observed flow rate: 99.99")

Regexpr & Regmatches

Here regexpr searches for a number with two digits before and after the period.

res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))

Using position parameters

Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.

library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)

If flow rate appears multiple times

Split the elements of the vector into substrings and capture the numbers as before.

ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))

Parsing txt files and extracting information in R

Answers (2)

Related Questions