Reputation: 1970
I have a large set of data in a list. The list consists of short strings. Inside the list are slices of length 5 hidden that match a certain pattern:
[<date>, <date>, <4 digit integer>, <string>, <$ amount>]
How can I extract these slices from my data set? They can occur in any location (so their index is not guaranteed to be a multiple of 5) and are interspersed with other data (also strings), that can match part of the pattern.
I started with something similar to:
for item in data:
if re.search(<date pattern>, item):
if not date1:
date1 = item
else:
date2 = item
if re.search(<4 digit integer pattern>, item):
if date1 and date2 and not fourdigit:
fourdigit = item
else:
date1 = None
date2 = None
....
But this is very complicated, prone to errors and not pythonic at all.
The next approach was to extract a sliding window of 5 items from the list of data and check that all items match their pattern. If not, increment the index by 1 (i.e. slide the window by 1) and check the next slice. If the pattern matches, save the slice, and increment the index by 5. Something like:
index = 0
while index < (len(data)-5):
sliceof5 = data[index:index+5]
if slice_matches_pattern(sliceof5):
matching_items.append(sliceof5)
index += 5
else:
index += 1
This works and is a lot easier to implement and less error prone previous solution, but doesn't seem very pythonic either.
Is it maybe possible to do this using list comprehension? Something like:
matching_items = [ sliceof5 if slice_matches_pattern(sliceof5) for sliceof5 in data ]
But then, how do I make the for
in the list comprehension sometimes skip forward 1 and sometimes forward 5.
Are there maybe other, pythonic ways to achieve this?
Upvotes: 2
Views: 188
Reputation: 1716
Having no idea what your data looks like, following @hpaulj's idea, here's a regex approach.
import re
data = [
'2016-12-01', '2016-12-02', '1234', 'spam', '$100', # collect
'2016-12-02', 'spam', # discard
'2016-12-01', '2016-12-02', '1234', 'spam', '$100', # collect
'1234', '2016-12-01', # discard
'2016-12-01', '2016-12-02', '1234', 'spam', '$100', # collect
'$100', '1234', '1234', # discard
'2016-12-01', '2016-12-02', '1234', 'spam', '$100' # collect
]
pattern_sep_str = '||' # change to something unique in the data
pattern_sep = re.escape(pattern_sep_str)
date_pattern = r'[0-9]{4}-[0-9]{2}-[0-9]{2}'
int_pattern = r'[0-9]{4}'
str_pattern = r'[a-zA-Z]+'
amount_pattern = r'\$[0-9,.]+'
pattern_combined = ''.join([
'(', date_pattern, pattern_sep, date_pattern, pattern_sep,
int_pattern, pattern_sep, str_pattern, pattern_sep,
amount_pattern, ')'
])
results = re.findall(pattern_combined, pattern_sep_str.join(data))
print([x.split(pattern_sep_str) for x in results])
>>> [['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100']]
Upvotes: 2
Reputation: 39023
Your second solution seems fine. I would change it to a generator (yield
the slices you find), but not more.
You can probably make it run faster, though, by looking for a date in item number 2 of the slice. If it's not a date, you can add two to the index.
Of course, if you can turn everything into one big regular expression that matches your entire pattern, you'll do even better.
Upvotes: 2