NZD
NZD

Reputation: 1970

Python2.7: Extract slice from a list based on a pattern in a Pythonic way

I have a large set of data in a list. The list consists of short strings. Inside the list are slices of length 5 hidden that match a certain pattern:

[<date>, <date>, <4 digit integer>, <string>, <$ amount>]

How can I extract these slices from my data set? They can occur in any location (so their index is not guaranteed to be a multiple of 5) and are interspersed with other data (also strings), that can match part of the pattern.

I started with something similar to:

for item in data:
    if re.search(<date pattern>, item):
        if not date1:
            date1 = item
        else:
            date2 = item
    if re.search(<4 digit integer pattern>, item):
        if date1 and date2 and not fourdigit:
            fourdigit = item
        else:
           date1 = None
           date2 = None
    ....

But this is very complicated, prone to errors and not pythonic at all.

The next approach was to extract a sliding window of 5 items from the list of data and check that all items match their pattern. If not, increment the index by 1 (i.e. slide the window by 1) and check the next slice. If the pattern matches, save the slice, and increment the index by 5. Something like:

index = 0
while index < (len(data)-5):
    sliceof5 = data[index:index+5]
    if slice_matches_pattern(sliceof5):
        matching_items.append(sliceof5)
        index += 5
    else:
        index += 1

This works and is a lot easier to implement and less error prone previous solution, but doesn't seem very pythonic either.

Is it maybe possible to do this using list comprehension? Something like:

matching_items = [ sliceof5 if slice_matches_pattern(sliceof5) for sliceof5 in data ]

But then, how do I make the for in the list comprehension sometimes skip forward 1 and sometimes forward 5.

Are there maybe other, pythonic ways to achieve this?

Upvotes: 2

Views: 188

Answers (2)

double_j
double_j

Reputation: 1716

Having no idea what your data looks like, following @hpaulj's idea, here's a regex approach.

import re

data = [
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '2016-12-02', 'spam',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '1234', '2016-12-01',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '$100', '1234', '1234',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100'  # collect
]

pattern_sep_str = '||'  # change to something unique in the data

pattern_sep = re.escape(pattern_sep_str)
date_pattern = r'[0-9]{4}-[0-9]{2}-[0-9]{2}'
int_pattern = r'[0-9]{4}'
str_pattern = r'[a-zA-Z]+'
amount_pattern = r'\$[0-9,.]+'

pattern_combined = ''.join([
    '(', date_pattern, pattern_sep, date_pattern, pattern_sep,
    int_pattern, pattern_sep, str_pattern, pattern_sep,
    amount_pattern, ')'
])

results = re.findall(pattern_combined, pattern_sep_str.join(data))

print([x.split(pattern_sep_str) for x in results])

>>> [['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100']]

Upvotes: 2

zmbq
zmbq

Reputation: 39023

Your second solution seems fine. I would change it to a generator (yield the slices you find), but not more.

You can probably make it run faster, though, by looking for a date in item number 2 of the slice. If it's not a date, you can add two to the index.

Of course, if you can turn everything into one big regular expression that matches your entire pattern, you'll do even better.

Upvotes: 2

Related Questions