Reputation: 58
I want to add all the data from charts.zip from https://doi.org/10.5281/zenodo.4778562 in a single DataFrame. The data consist of a file per year that contains multiple CSVs. I made the following code:
header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
weekly_chart = pd.read_csv(file, header=header, sep='\t')
weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
weekly_chart['region'] = region
dfs.append(weekly_chart)
all_charts = pd.concat(dfs)
But, when I run it, python returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_12886/3473678833.py in <module>
9 weekly_chart['region'] = region
10 dfs.append(weekly_chart)
---> 11 all_charts = pd.concat(dfs)
~/Downloads/enter/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
344 ValueError: Indexes have overlapping values: ['a']
345 """
--> 346 op = _Concatenator(
347 objs,
348 axis=axis,
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
401
402 if len(objs) == 0:
--> 403 raise ValueError("No objects to concatenate")
404
405 if keys is None:
ValueError: No objects to concatenate
How can I fix it?
Upvotes: -1
Views: 87
Reputation: 13241
I think the glob.glob
might just be over complicating things... This works perfectly for me.
# Gives you a list of EVERY file in the Charts directory
# and sub directories that is a CSV
file_list = []
for path, subdirs, files in os.walk("Charts"):
file_list.extend([os.path.join(path, x) for x in files if x.endswith('.csv')])
dfs = []
for file in file_list:
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
df = pd.read_csv(file, sep='\t')
df['week_start'] = dates[0]
df['week_end'] = dates[1]
df['region'] = region
dfs.append(df)
all_charts = pd.concat(dfs, ignore_index=True)
print(all_charts)
Output:
position song_id song_name artist streams ... peak_position position_status week_start week_end region
0 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 ... 1 0 2017-10-20 2017-10-27 us
1 2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 ... 2 0 2017-10-20 2017-10-27 us
2 3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 ... 3 1 2017-10-20 2017-10-27 us
3 4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 ... 4 1 2017-10-20 2017-10-27 us
4 5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 ... 1 -2 2017-10-20 2017-10-27 us
... ... ... ... ... ... ... ... ... ... ... ...
273595 196 6kex4EBAj0WHXDKZMEJaaF Swalla (feat. Nicki Minaj & Ty Dolla $ign) Jason Derulo 3747830 ... 8 -5 2018-03-02 2018-03-09 global
273596 197 0CokSRCu5hZgPxcZBaEzVE Glorious (feat. Skylar Grey) Macklemore 3725286 ... 14 -8 2018-03-02 2018-03-09 global
273597 198 7oK9VyNzrYvRFo7nQEYkWN Mr. Brightside The Killers 3717326 ... 148 -3 2018-03-02 2018-03-09 global
273598 199 7EUfNvyCVxQV3oN5ScA2Lb Next To Me Imagine Dragons 3681739 ... 122 -77 2018-03-02 2018-03-09 global
273599 200 6u0EAxf1OJTLS7CvInuNd7 Vai malandra (feat. Tropkillaz & DJ Yuri Martins) Anitta 3676542 ... 30 -23 2018-03-02 2018-03-09 global
If you really want the dates to be dates, you can run this on those two columns at the end.
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
Personally, I'd also do the following:
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
all_charts['week_end'] = pd.to_datetime(all_charts['week_end'])
all_charts['region'] = all_charts['region'].astype('category')
all_charts['artist'] = all_charts['artist'].astype('category')
all_charts['song_name'] = all_charts['song_name'].astype('category')
all_charts['song_id'] = all_charts['song_id'].astype('category')
all_charts.set_index(['region', 'week_start', 'week_end', 'position'], inplace=True)
all_charts.position_status = pd.to_numeric(all_charts.position_status, errors='coerce')
print(df.head(10))
Giving:
song_id song_name artist streams last_week_position weeks_on_chart peak_position position_status
region week_start week_end position
us 2017-10-20 2017-10-27 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 1.0 3 1 0.0
2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 2.0 6 2 0.0
3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 4.0 5 3 1.0
4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 5.0 3 4 1.0
5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 3.0 26 1 -2.0
6 5Gd19NupVe5X8bAqxf9Iaz Gorgeous Taylor Swift 6940802 NaN 1 6 NaN
7 0ofbQMrRDsUaVKq2mGLEAb Havana Camila Cabello 6623184 10.0 12 7 3.0
8 2771LMNxwf62FTAdpJMQfM Bodak Yellow Cardi B 6472727 6.0 14 3 -2.0
9 5Z3GHaZ6ec9bsiI5BenrbY Young Dumb & Broke Khalid 5982108 9.0 29 6 0.0
10 7GX5flRQZVHRAGd6B4TmDO XO Tour Llif3 Lil Uzi Vert 5822583 8.0 9 2 -2.0
Upvotes: 0