M.Huntz
M.Huntz

Reputation: 253

Extracting a paragraph from articles | Regular Expression

I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.

This is a sample of the articles scraped:

By   DAVID D. KIRKPATRICK    MARCH 18, 2015 
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry 
that is vital to  Tunisia  as it struggles to consolidate the only transition to democracy 
after the Arab Spring revolts. 
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.

What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.

This is the regular expression that I came up with:

([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s

With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.

Upvotes: 2

Views: 2907

Answers (2)

Naveen Kumar R B
Naveen Kumar R B

Reputation: 6398

EDIT1:

try the regex as follows:

([A-Z]+\w+\s*—\s*.*?\.)

It is about grouping, though it matches the text that you want.

try the following regex (surround the regex with parenthisis):

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)

Group 1 contains the required string/text.

Image reference: enter image description here

Upvotes: 0

Fallenhero
Fallenhero

Reputation: 1583

Use non-greedy

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)

The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.

Upvotes: 2

Related Questions