Reputation: 881
I have lots of text files. I need to extract date out of them. The required date actually is present in the text file in the format as follows:
random text here
261 words 18 February 2008 22:25 Dow Jones International News DJI English
/This is a sample statement of line present in text file from which the date has to be extracted/.
So now in this problem actually in a single text file there are many such sentences but always in this format.
A possible algorithm can be to select and display three words right after the "words" word in above sentence. Those three picked up will be the date. I needed a code in R for this.
Again to mention that the data in the text file is NOT in columns and rows but its paragraph wise. It is NOT present in bullet form either. It is paragraph wise.
Upvotes: 0
Views: 209
Reputation: 269526
grep
out the lines starting with digits and then "words" (allowing for spaces), remove everything before "words" and convert to "Date"
class. Note that as.Date
ignores any junk after the date.
# test data
Lines <- "random text here
261 words 18 February 2008 22:25 Dow Jones International News DJI English
/This is a sample statement of line present in text file from which the
date has to be extracted/.
11 words 18 January 2009 20:20 Dow Jones International News DJI English
"
L <- readLines(textConnection(Lines))
pat <- "^ *\\d+ words "
words.lines <- grep(pat, L, value = TRUE)
as.Date(sub(pat, "", words.lines), format = "%d %B %Y")
giving:
[1] "2008-02-18" "2009-01-18"
Upvotes: 1