codebreaker
codebreaker

Reputation: 813

Regular expression for items listed in plain english

This is sort of a contrived example, but I'm trying to get at a general principle here.

Given phrases written in English using this list-like form:

I have a cat
I have a cat and a dog
I have a cat, a dog, and a guinea pig
I have a cat, a dog, a guinea pig, and a snake

Can I use a regular expression to get all of the items, regardless of how many there are? Note that the items may contain multiple words.

Obviously if I have just one, then I can use I have a (.+), and if there are exactly two, I have a (.+) and a (.+) works.

But things get more complicated if I want to match more than just one example. If I want to extract the list items from the first two examples, I would think this would work: I have a (.*)(?: and a (.*))? And while this works on the first phrase, telling me I have a cat and null, for the second one it tells me I have a cat and a dog and null. Things only get worse when I try to match phrases in even more forms.

Is there any way I can use regexes for this purpose? It seems rather simple, and I don't understand why my regex that matches 2-item lists works, but the one that matches 1- or 2-item lists does not.

Upvotes: 7

Views: 691

Answers (4)

jawee
jawee

Reputation: 271

Provide one java implementiaon, using the positive lookahead regexp. See below:

String str0 = "I have a cat";
String str1 = "I have a cat and a dog";
String str2 = "I have a cat, a dog, and a guinea pig";
String str3 = "I have a cat, a dog, a guinea pig, and a snake";

String regexp = "(?m)\\ba\\s+.*?(?=(?:,|$|and))";

Pattern pMod = Pattern.compile(regexp);
Matcher mMod = pMod.matcher(str3);

while (mMod.find()) {
    System.out.println(mMod.group(0));
}

For str3, the output is:

a cat
a dog
a guinea pig
a snake

if the item could be 'a', 'an', or 'one' starting, then the regex could be (?m)\\b(one|an|a)\\s+.*?(?=(?:,|$|and))

(?m) means to enable the MULTILINE flag when doing the parsing. In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

What you can do is to use the \G anchor with the find method:

(?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?>[b-z]+|\Ba|a(?!nd\b))+(?> (?>[b-z]+|\Ba|a(?!nd\b))+)*)

or more simple:

(?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?!and\b)[a-z]+(?> (?!and\b)[a-z]+)*)

The \G is the position in the string after the last match. The pattern has two entry points. The first match will use the second entry point: \bI have, and next matches, the first entry point that allows only contiguous results.

Note: \G means match the position after the last match but it match the start of the string too. (?!\A) is here to avoid this case.

online demo

regex planet (click the Java button)

Upvotes: 1

Santa
Santa

Reputation: 11545

I use regex splitting to do it. But this assumes sentence format exactly matching your input set:

>>> SPLIT_REGEX = r', |I have|and|, and'
>>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'):
...     print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()]
... 
['a cat']
['a cat', 'a dog']
['a cat', 'a dog', 'a guinea pig']
['a cat', 'a dog', 'a guinea pig', 'a snake']

Upvotes: 1

Nir Alfasi
Nir Alfasi

Reputation: 53535

You can use a non-capturing group as a conditional delimiter (either a comma or end-of-line):
' a (.*?)(?:,|$)'

Example in python:

import re
line = 'I have a cat, a dog, a guinea pig, and a snake'
mat = re.findall(r' a (.*?)(?:,|$)', line)
print mat # ['cat', 'dog', 'guinea pig', 'snake']

Upvotes: 1

Related Questions