Siddhant Sahni
Siddhant Sahni

Reputation: 157

Pattern to not allow certain file names

I'm reading some files from a directory which starts with 'buy_train_' and ends with 0.

data = [pd.read_parquet(f,engine='fastparquet') for f in glob.glob(local_data_path+'buy_train_20**-**-**_****_0')]

I need to update this regex expression so that it wouldn't include files with names starting from 'buy_train_2018', i.e., if the file has 2018 then bypass it. But I still need to filter out files basis the first regex expression, this is just another filter I want to be added. Can anyone help me with this as I tried making an expression like below

buy_train_20**(?![8])-**-**_**_0

That should have filtered out anything which ends with 8, but it failed to do so.

Any help is appreciated.

Edit1:

Examples-

buy_train_2020-01-08_001221_0

buy_train_2020-05-02_067341_0

buy_train_2020-07-26_011901_0

-> The above examples will be acceptable file names as they don't have 2018 after 'buy_train'.

buy_train_2018-10-16_617901_0

buy_train_2018-12-19_492111_0

-> The above examples should be filtered out as they have 2018 after 'buy_train'

Upvotes: 1

Views: 273

Answers (1)

Laurent B.
Laurent B.

Reputation: 2263

First method with regex

(?!...)

Matches if ... doesn’t match next. This is a negative lookahead assertion.


import re

regex_filter = 'buy_train_20(?!18)\d*-\d*-\d*_\d*_0'

expr1 = 'buy_train_2018-10-16_617901_0'
m = re.search(regex_filter, expr1)
print(m)
# None 
# (if None not do not try to print)

expr2 = 'buy_train_2020-01-08_001221_0'
m = re.search(regex_filter, expr2)
print(m)
print(m.group(0))
# <_sre.SRE_Match object; span=(0, 29), match='buy_train_2020-01-08_001221_0'>
# buy_train_2020-01-08_001221_0

Second method with filter native function:

But you don't necessarily need to use regex for filtering, just use for instance the native filter function as follows:

paths = ['buy_train_2020-01-08_001221_0',
         'buy_train_2020-05-02_067341_0',
         'buy_train_2020-07-26_011901_0',
         'buy_train_2018-10-16_617901_0',
         'buy_train_2018-12-19_492111_0']


prefix = 'buy_train_2018'

def function(path):
    if path[:len(prefix)] == prefix:
        return False
    else:
        return True

results = filter(function, paths)

for res in results:
    print (res)
    
# buy_train_2020-01-08_001221_0
# buy_train_2020-05-02_067341_0
# buy_train_2020-07-26_011901_0

Upvotes: 1

Related Questions