Reputation: 813
This is sort of a contrived example, but I'm trying to get at a general principle here.
Given phrases written in English using this list-like form:
I have a cat
I have a cat and a dog
I have a cat, a dog, and a guinea pig
I have a cat, a dog, a guinea pig, and a snake
Can I use a regular expression to get all of the items, regardless of how many there are? Note that the items may contain multiple words.
Obviously if I have just one, then I can use I have a (.+)
, and if there are exactly two, I have a (.+) and a (.+)
works.
But things get more complicated if I want to match more than just one example. If I want to extract the list items from the first two examples, I would think this would work: I have a (.*)(?: and a (.*))?
And while this works on the first phrase, telling me I have a cat
and null
, for the second one it tells me I have a cat and a dog
and null
. Things only get worse when I try to match phrases in even more forms.
Is there any way I can use regexes for this purpose? It seems rather simple, and I don't understand why my regex that matches 2-item lists works, but the one that matches 1- or 2-item lists does not.
Upvotes: 7
Views: 691
Reputation: 271
Provide one java implementiaon, using the positive lookahead regexp. See below:
String str0 = "I have a cat";
String str1 = "I have a cat and a dog";
String str2 = "I have a cat, a dog, and a guinea pig";
String str3 = "I have a cat, a dog, a guinea pig, and a snake";
String regexp = "(?m)\\ba\\s+.*?(?=(?:,|$|and))";
Pattern pMod = Pattern.compile(regexp);
Matcher mMod = pMod.matcher(str3);
while (mMod.find()) {
System.out.println(mMod.group(0));
}
For str3, the output is:
a cat
a dog
a guinea pig
a snake
if the item could be 'a', 'an', or 'one' starting, then the regex could be (?m)\\b(one|an|a)\\s+.*?(?=(?:,|$|and))
(?m)
means to enable the MULTILINE flag when doing the parsing.
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.
Upvotes: 1
Reputation: 89574
What you can do is to use the \G
anchor with the find
method:
(?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?>[b-z]+|\Ba|a(?!nd\b))+(?> (?>[b-z]+|\Ba|a(?!nd\b))+)*)
or more simple:
(?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?!and\b)[a-z]+(?> (?!and\b)[a-z]+)*)
The \G
is the position in the string after the last match. The pattern has two entry points. The first match will use the second entry point: \bI have
, and next matches, the first entry point that allows only contiguous results.
Note: \G
means match the position after the last match but it match the start of the string too. (?!\A)
is here to avoid this case.
regex planet (click the Java button)
Upvotes: 1
Reputation: 11545
I use regex splitting to do it. But this assumes sentence format exactly matching your input set:
>>> SPLIT_REGEX = r', |I have|and|, and'
>>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'):
... print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()]
...
['a cat']
['a cat', 'a dog']
['a cat', 'a dog', 'a guinea pig']
['a cat', 'a dog', 'a guinea pig', 'a snake']
Upvotes: 1
Reputation: 53535
You can use a non-capturing group as a conditional delimiter (either a comma or end-of-line):
' a (.*?)(?:,|$)'
Example in python:
import re
line = 'I have a cat, a dog, a guinea pig, and a snake'
mat = re.findall(r' a (.*?)(?:,|$)', line)
print mat # ['cat', 'dog', 'guinea pig', 'snake']
Upvotes: 1