Reputation: 636
I have a large paragraph of text, for example:
mytext <- "Date joined: some long text here 01/02/2012. Some more long text here, then commencement date: 1/5/2012. More info at the end."
I would like to extract all the dates found in any sentences that contain either the word "Date joined" or "commencement date"
So my result would be 1/5/2012 and 01/02/2012.
I tried a few patterns with regex but havent been successful so far.
Upvotes: 0
Views: 61
Reputation: 41
Here is the original text you provided:
mytext <- "Date joined: some long text here 01/02/2012. Some more long text here, then commencement date: 1/5/2012. More info at the end."
First split the text at the periods and return a vector of sentences.
sentences <- strsplit(mytext,".",fixed=TRUE)[[1]]
Then we select only those sentences with the phrases you noted.
relevant <- sentences[grepl("Date joined|commencement date",sentences)]
Now we can search for the dates:
unlist(regmatches(relevant,gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}",relevant)))
This produces the vector:
[1] "01/02/2012" "1/5/2012"
Depending on how widely the date format varies you may have to adjust this regular expressions pattern. Also, I used exactly the phrases (with capitalization) you provided to select the sentences. If the phrases are not case-sensitive then you will want to use the ignore.case=TRUE
option when selecting your sentences with these phrases.
UPDATE: The original poster asked how to extract the first date found in each relevant sentence. I modified the code to provide a relevant example for that situation and to use sapply.
mytext <- "Date joined: some long text here 01/02/2012 and also here 05/13/1899. Some more long text here, then commencement date: 1/5/2012 and also 2/3/4567. More info at the end."
sentences <- strsplit(mytext,".",fixed=TRUE)[[1]]
relevant <- sentences[grepl("Date joined|commencement date",sentences)]
the_dates <- regmatches(relevant,gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}",relevant))
the_first_dates <- sapply(the_dates,function(z) z[1])
In the_first_dates
we now have
[1] "01/02/2012" "1/5/2012"
Upvotes: 1