squishy
squishy

Reputation: 489

Parsing txt files and extracting information in R

I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.

I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?

Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:

all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation:    ",all_data)
all_data[hours_of_operation]
[1] "    Annual Hours of Operation:    8760.0 hours/yr"

Thanks

J

Upvotes: 5

Views: 9948

Answers (2)

Ruthger Righart
Ruthger Righart

Reputation: 4921

The following may help. I assume that you brought your text to character vector(s)

Data example

Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)

ex<-c("The annual observed flow rate: 99.99")

Regexpr & Regmatches

Here regexpr searches for a number with two digits before and after the period.

res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))

Using position parameters

Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.

library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)

If flow rate appears multiple times

Split the elements of the vector into substrings and capture the numbers as before.

ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))

Upvotes: 1

Allen Wang
Allen Wang

Reputation: 2502

I am guessing that you have one data point on each line that you want to parse. If so, you can read the data into a vector and use the grepl() function to find all instances of the vector that have what you need.

So for example you have the data:

lhr: time to departure 5:00
dfw: time to arrival 4:40
jfk: time to arrival 5:50
dfw: time to departure 6:00
lax: time to departure 6:00

And you want to take out the "dfw: " entries then you do

data = readLines("file.txt")
data[grepl("dfw: ", data)]

And if you want the second entry of this, you do

data[grepl("dfw: ", data)][2]

Upvotes: 3

Related Questions