Reputation: 22041
I have the following python list:
['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv', 'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']
How do I separate it into 2 lists:
['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv'] and ['daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']
The lists are split based on the words preceeding the year i.e. 2000...
I know I should use regex in python but not sure how to do it. Also, the solution needs to be extensible and not dependent on actual names e.g. chattisgarh
Upvotes: 2
Views: 447
Reputation: 22564
Here is one way to get a dictionary, where for each "name" key the value is a list of the strings starting with that name, keeping the order of the original list. This does not use regex and in fact uses no modules at all. You can easily modify this to make a function, remove the trailing underscore from each name, checking for various errors in the data list, getting the resulting lists out of the dictionary, and so on.
If you allow other modules, or allow changes in the order, I'm sure there are other ways.
a = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
'daman_and_diu_2002_aa.csv']
names_dict = {}
for item in a:
# Find the first numeric character in the item
for i, c in enumerate(item):
if c.isdigit():
break
# Store the string in the dictionary according to its preceding characters
name = item[:i]
if names_dict.get(name, None):
names_dict[name].append(item)
else:
names_dict[name] = [item]
print(names_dict)
The result of this code (prettified) is
{'daman_and_diu_': [
'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
'daman_and_diu_2002_aa.csv'],
'chhattisgarh_': [
'chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv']
}
Upvotes: 4
Reputation: 215137
Another option to use regular expression combined with dictionary:
files = ["chhattisgarh_2015_aa.csv", "chhattisgarh_2016_aa.csv", "daman_and_diu_2000_aa.csv", "daman_and_diu_2001_aa.csv", "daman_and_diu_2002_aa.csv"]
import re
from collections import defaultdict
groupedFiles = defaultdict(list)
for fileName in files:
pattern = re.findall("(.*)\\d{4}", fileName)[0]
groupedFiles[pattern].append(fileName)
groupedFiles
{'chhattisgarh_': ['chhattisgarh_2015_aa.csv',
'chhattisgarh_2016_aa.csv'],
'daman_and_diu_': ['daman_and_diu_2000_aa.csv',
'daman_and_diu_2001_aa.csv',
'daman_and_diu_2002_aa.csv']}
Upvotes: 2
Reputation: 103585
You can use itertools.groupby
here:
import itertools
import re
list = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
'daman_and_diu_2002_aa.csv']
grouped = itertools.groupby(sorted(list), lambda x: re.match('(.+)_\d{4}', x).group(1))
for (key, values) in grouped:
print(key)
print([x for x in values])
The regex (.+)_\d{4}
matches a group of at least one character (which is what we group by) followed by an underscore and 4 digits.
Upvotes: 5