Reputation: 253
I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.
This is a sample of the articles scraped:
By DAVID D. KIRKPATRICK MARCH 18, 2015
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry
that is vital to Tunisia as it struggles to consolidate the only transition to democracy
after the Arab Spring revolts.
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.
What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.
This is the regular expression that I came up with:
([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s
With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.
Upvotes: 2
Views: 2907
Reputation: 6398
EDIT1:
try the regex as follows:
([A-Z]+\w+\s*—\s*.*?\.)
It is about grouping, though it matches the text that you want.
try the following regex (surround the regex with parenthisis):
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)
Group 1 contains the required string/text.
Upvotes: 0
Reputation: 1583
Use non-greedy
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)
The ?
after a +
(or *
) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.
Upvotes: 2