Reputation: 1253

Remove Dates from a file name before the extension

I have some filenames in a list that have different extensions.

file_name_list = ['ABDCD Pattern Raw Data 1.4.2016.xlsx',
 'Jack Raw Data 1.2.2016.xlsx',
 'Farmers holdings 1.1.2016.xlsx',
 'Anne Raw Data 1.3.2016.csv',
 '120 Brewers 5-2-2018.txt']

I want to remove only the dates from these file names and add them to a new list. Just like this,

['abdcd pattern raw data.xlsx',
 'jack raw data.xlsx',
 'farmers holdings.xlsx',
 'anne raw data.csv',
 '120 brewers.txt']

I tired the following using this post, I took the numbers off, but not how I want.

import re
OutputList = []
for i in file_name_list:
    lower_character = i.lower()
    OutputList.append(re.sub('[0-9.-]', '', lower_character))

Output,

['abdcd pattern raw data xlsx',
 'jack raw data xlsx',
 'farmers holdings xlsx',
 'anne raw data csv',
 ' brewers txt']

If you look close, it took 120 from 120 Brewers. How can I achieve what I want? I am using python 3. Any suggestions would be nice.

Upvotes: 4

Answers (4)

Kasravnd

Reputation: 107287

If you also want to preserve the dates you need to use re.split() instead of re.sub() that removes the strings.

You can split based on the latest space or dot in the string as following:

In [59]: for x in file_name_list:
    ...:     a, date , c = re.split(r'(?=(?:(?:\.[^.]*| [^ ]*))$)', x)
    ...:     se.append(a + c)
    ...:     dates.append(date.strip())
    ...:     
    ...:     

In [60]: se
Out[60]: 
['ABDCD Pattern Raw Data.xlsx',
 'Jack Raw Data.xlsx',
 'Farmers holdings.xlsx',
 'Anne Raw Data.csv',
 '120 Brewers.txt']

In [61]: dates
Out[61]: ['1.4.2016', '1.2.2016', '1.1.2016', '1.3.2016', '5-2-2018']

And if you just wanna remove the dates

In [65]: [re.sub(r' (?:\d+[.-]){2}\d+','', x) for x in file_name_list]
Out[65]: 
['ABDCD Pattern Raw Data.xlsx',
 'Jack Raw Data.xlsx',
 'Farmers holdings.xlsx',
 'Anne Raw Data.csv',
 '120 Brewers.txt']

Upvotes: 1

d g

Reputation: 1604

import re

file_name_list = ['ABDCD Pattern Raw Data 1.4.2016.xlsx',
   'Jack Raw Data 1.2.2016.xlsx',
   'Farmers holdings 1.1.2016.xlsx',
   'Anne Raw Data 1.3.2016.csv',
   '120 Brewers 5-2-2018.txt']

for file in file_name_list:
   replaced = re.sub('\s\d{1,2}[\.-]\d{1,2}[\.-]\d{4}', '', file)
   print(replaced)

Output:

ABDCD Pattern Raw Data.xlsx
Jack Raw Data.xlsx
Farmers holdings.xlsx
Anne Raw Data.csv
120 Brewers.txt

Upvotes: 1

Zack

Reputation: 121

The replace also took your dots, so the files no longer have an extension. I see that the dates are in multiple formats, which isn't helpful, since what you need to do is examine your data (filenames) to determine a pattern that you can use to consistently discriminate the dates and only the dates from the rest of the filename.

From what you provided, it looks like a couple of splits may be in order. I'd split by the dot first, and then split by the space character. From the list of file name pieces, .pop the last item (the date) and .join the remainder of the list back together. Append your extension and you'll be good. This assumes that you have no filename dates of the format "abc xyz mm dd yyyy.ext"

Upvotes: 0

Matt.G

Reputation: 3609

Regex:

\s\d{1,2}(\.|\-)\d{1,2}\1\d{4}

Demo

Upvotes: 3

Remove Dates from a file name before the extension

Answers (4)

Related Questions