Reputation: 157
I'm reading some files from a directory which starts with 'buy_train_' and ends with 0.
data = [pd.read_parquet(f,engine='fastparquet') for f in glob.glob(local_data_path+'buy_train_20**-**-**_****_0')]
I need to update this regex expression so that it wouldn't include files with names starting from 'buy_train_2018', i.e., if the file has 2018 then bypass it. But I still need to filter out files basis the first regex expression, this is just another filter I want to be added. Can anyone help me with this as I tried making an expression like below
buy_train_20**(?![8])-**-**_**_0
That should have filtered out anything which ends with 8, but it failed to do so.
Any help is appreciated.
Edit1:
Examples-
buy_train_2020-01-08_001221_0
buy_train_2020-05-02_067341_0
buy_train_2020-07-26_011901_0
-> The above examples will be acceptable file names as they don't have 2018 after 'buy_train'.
buy_train_2018-10-16_617901_0
buy_train_2018-12-19_492111_0
-> The above examples should be filtered out as they have 2018 after 'buy_train'
Upvotes: 1
Views: 273
Reputation: 2263
First method with regex
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion.
import re
regex_filter = 'buy_train_20(?!18)\d*-\d*-\d*_\d*_0'
expr1 = 'buy_train_2018-10-16_617901_0'
m = re.search(regex_filter, expr1)
print(m)
# None
# (if None not do not try to print)
expr2 = 'buy_train_2020-01-08_001221_0'
m = re.search(regex_filter, expr2)
print(m)
print(m.group(0))
# <_sre.SRE_Match object; span=(0, 29), match='buy_train_2020-01-08_001221_0'>
# buy_train_2020-01-08_001221_0
Second method with filter native function:
But you don't necessarily need to use regex for filtering, just use for instance the native filter function as follows:
paths = ['buy_train_2020-01-08_001221_0',
'buy_train_2020-05-02_067341_0',
'buy_train_2020-07-26_011901_0',
'buy_train_2018-10-16_617901_0',
'buy_train_2018-12-19_492111_0']
prefix = 'buy_train_2018'
def function(path):
if path[:len(prefix)] == prefix:
return False
else:
return True
results = filter(function, paths)
for res in results:
print (res)
# buy_train_2020-01-08_001221_0
# buy_train_2020-05-02_067341_0
# buy_train_2020-07-26_011901_0
Upvotes: 1