Reputation: 489
I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.
I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?
Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:
all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation: ",all_data)
all_data[hours_of_operation]
[1] " Annual Hours of Operation: 8760.0 hours/yr"
Thanks
J
Upvotes: 5
Views: 9948
Reputation: 4921
The following may help. I assume that you brought your text to character vector(s)
Data example
Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)
ex<-c("The annual observed flow rate: 99.99")
Regexpr & Regmatches
Here regexpr searches for a number with two digits before and after the period.
res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))
Using position parameters
Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.
library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)
If flow rate appears multiple times
Split the elements of the vector into substrings and capture the numbers as before.
ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))
Upvotes: 1
Reputation: 2502
I am guessing that you have one data point on each line that you want to parse. If so, you can read the data into a vector and use the grepl()
function to find all instances of the vector that have what you need.
So for example you have the data:
lhr: time to departure 5:00
dfw: time to arrival 4:40
jfk: time to arrival 5:50
dfw: time to departure 6:00
lax: time to departure 6:00
And you want to take out the "dfw: " entries then you do
data = readLines("file.txt")
data[grepl("dfw: ", data)]
And if you want the second entry of this, you do
data[grepl("dfw: ", data)][2]
Upvotes: 3