Reputation: 13
I have a list of strings containing dates, country, and city:
myList = ["(1922, May, 22; USA; CHICAGO)","(1934, June, 15; USA; BOSTON)"]
I want to extract only the date and the city (cities are always with capital letters). So far I have this:
for info in myList:
pattern_i = re.compile(r"[^;]+")
pattern_f = re.compile(r";\s\b([A-Z]+)\)")
mi = re.match(pattern_i, info)
mf = re.match(pattern_f, info)
print(mi)
print(mf)
I am getting:
<re.Match object; span=(0, 14), match='(1922, May, 22'>
None
<re.Match object; span=(0, 15), match='(1934, June, 15'>
None
I've tried so many things and can't seem to find a solution. What am I missing here?
Upvotes: 0
Views: 879
Reputation: 37755
thanks! But I am still curious, why am I getting None for mf?
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
Ref DOcs
re.match
searches for match at the beginning of string, since the pattern you're trying to match isn't at the start of string so you're getting None
you can use re.search
is one option to find match value anywhere in the string
As i suggested split is a better option here, you should split by ;
and take the first and last element to get the desired output
Upvotes: 0
Reputation: 360
Regex is overkill for data with simple, consistent formatting. This can be done easily using the built in string manipulation functions.
for entry in myList:
date, country, city = [x.strip() for x in entry[1:-1].split(';')]
# Explanation
entry[1:-1] # Strip off the parenthesis
entry[1:-1].split(';') # Split into a list of strings using the ';' character
x.strip() # Strip extra whitespace
Upvotes: 1
Reputation: 150735
You can use pandas
:
p='\((?P<date>.*);.*;(?P<city>.*)\)'
pd.Series(myList).str.extract(p)
Output:
date city
0 1922, May, 22 CHICAGO
1 1934, June, 15 BOSTON
Upvotes: 0