Extracting a paragraph from articles | Regular Expression

Question

I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.

This is a sample of the articles scraped:

By   DAVID D. KIRKPATRICK    MARCH 18, 2015 
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry 
that is vital to  Tunisia  as it struggles to consolidate the only transition to democracy 
after the Arab Spring revolts. 
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.

What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.

This is the regular expression that I came up with:

([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s

With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.

Fallenhero · Accepted Answer

Use non-greedy

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)

The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.

Extracting a paragraph from articles | Regular Expression

Answers (2)

Related Questions