Reputation: 294488
I have a list of strings
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
I want to extract:
I could do:
import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]
but it fails because pattern doesn't match "with Tom on 2015-06-30"
.
How do I specify the regex pattern to be indifferent to the order in which date or person appear in the string?
and
How do I ensure that the groups()
method returns them in the same order every time?
I expect the output to look like this?
[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
Upvotes: 4
Views: 562
Reputation:
If you use Python's new regex module, you can use conditionals to get
a guaranteed match on 2 items.
I'd think this is more like a standard to do out-of-order matching.
(?:.*?(?:(?(1)(?!))\b(\d{4}-\d\d-\d\d)\b|(?(2)(?!))with[ ](\w+))){2}
Expanded
(?:
.*?
(?:
(?(1)(?!))
\b
( \d{4} - \d\d - \d\d ) # (1)
\b
| (?(2)(?!))
with [ ]
( \w+ ) # (2)
)
){2}
Upvotes: 2
Reputation: 474041
Just for education reasons, a non-regex approach could involve using dateutil
parser in a "fuzzy" mode to extract the dates and the nltk
toolkit with the named entity recognition to extract names. Complete code:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse
def extract_names(text):
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(text)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30"
]
for s in my_strings:
print(parse(s, fuzzy=True))
print(extract_names(s))
Prints:
2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']
That's probably an over-complication though.
Upvotes: 2
Reputation: 15433
What about doing it with 2 separate regex?
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]
pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]
output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
Upvotes: 4
Reputation: 15310
This should work:
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"
for tc in my_strings:
print(tc)
m = re.match(alternates, tc)
if m:
print("\t", m.group(1))
print("\t", m.group(2))
Output is:
$ python test.py
2002-03-04 with Matt
2002-03-04
Matt
Important: 2016-01-23 with Mary
2016-01-23
Mary
with Tom on 2015-06-30
2015-06-30
Tom
However, something like this is not totally intuitive. I encourage you to try using named groups if at all possible.
Upvotes: 2