southwind
southwind

Reputation: 636

Extract date if found certain keywords

I have a large paragraph of text, for example:

mytext <- "Date joined: some long text here 01/02/2012. Some more long text here, then commencement date: 1/5/2012. More info at the end."

I would like to extract all the dates found in any sentences that contain either the word "Date joined" or "commencement date"

So my result would be 1/5/2012 and 01/02/2012.

I tried a few patterns with regex but havent been successful so far.

Upvotes: 0

Views: 61

Answers (1)

Christopher Brown
Christopher Brown

Reputation: 41

Here is the original text you provided:

mytext <- "Date joined: some long text here 01/02/2012. Some more long text here, then commencement date: 1/5/2012. More info at the end."

First split the text at the periods and return a vector of sentences.

sentences <- strsplit(mytext,".",fixed=TRUE)[[1]]

Then we select only those sentences with the phrases you noted.

relevant <- sentences[grepl("Date joined|commencement date",sentences)]

Now we can search for the dates:

unlist(regmatches(relevant,gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}",relevant)))

This produces the vector:

[1] "01/02/2012" "1/5/2012"

Depending on how widely the date format varies you may have to adjust this regular expressions pattern. Also, I used exactly the phrases (with capitalization) you provided to select the sentences. If the phrases are not case-sensitive then you will want to use the ignore.case=TRUE option when selecting your sentences with these phrases.

UPDATE: The original poster asked how to extract the first date found in each relevant sentence. I modified the code to provide a relevant example for that situation and to use sapply.

mytext <- "Date joined: some long text here 01/02/2012 and also here 05/13/1899. Some more long text here, then commencement date: 1/5/2012 and also 2/3/4567. More info at the end."
sentences <- strsplit(mytext,".",fixed=TRUE)[[1]]
relevant <- sentences[grepl("Date joined|commencement date",sentences)]
the_dates <- regmatches(relevant,gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}",relevant))
the_first_dates <- sapply(the_dates,function(z) z[1])

In the_first_dates we now have

[1] "01/02/2012" "1/5/2012"

Upvotes: 1

Related Questions