geotheory
geotheory

Reputation: 23650

Extract first sentence in string

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

Any tips much appreciated.

Upvotes: 6

Views: 2444

Answers (2)

dmi3kno
dmi3kno

Reputation: 3045

corpus has special handling for abbreviations when determining sentence boundaries:

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.  

There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

See this regex demo.

Details

  • .*? - any 0+ chars other than line break chars
  • [a-z0-9] - an ASCII lowercase letter or a digit
  • [.?!] - a ., ? or !
  • (?= ) - that is followed with a literal space.

Alternatively, you may use sub:

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

See this regex demo.

Details

  • ([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
  • \s - a whitespace
  • .* - any 0+ chars, as many as possible (up to the end of string).

Upvotes: 6

Related Questions