Reputation: 97
I have a Python script and pandas to combine multiple ZIP files. I am using data hosted in a GitHub repository here: https://github.com/statistikat/coronaDAT
The script should take all ZIP files in a folder structure, find the "Bezirke.csv" file in the ZIP file, and combine all the Bezirke.csv files into one large CSV file. However, the code is only grabbing one ZIP file from the folder.
Any suggestions on why the it is not taking the data from the other ZIP files in the folder?
import glob
from zipfile import ZipFile
path = r'/Users/matt/test/' # use your path
#load all zip files in folder
all_files = glob.glob(path + "/*.zip")
li = []
for filename in all_files:
zip_file = ZipFile(filename)
df = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename), delimiter=';', header=0, index_col=['Timestamp'], parse_dates=['Timestamp'])
for text_file in zip_file.infolist()
if text_file.filename.endswith('Bezirke.csv')}
li.append(df)
#print dataframe in console
print(df)
#prepare date to export to csv
frame = pd.concat(df, axis=0)
#export to csv
frame.to_csv( "combined_zip_Bezirke.csv", encoding='utf-8-sig')
print("Export to CSV Successful")```
Upvotes: 3
Views: 3789
Reputation: 1413
I made some changes to your code and tested it with three sample files:
20200523_010000_orig_csv.zip
20200523_020000_orig_csv.zip
20200523_030000_orig_csv.zip
Modified code:
df_master = pd.DataFrame()
flag = False
for filename in all_files:
zip_file = ZipFile(filename)
for text_file in zip_file.infolist():
if text_file.filename.endswith('Bezirke.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['Timestamp'],
parse_dates=['Timestamp']
)
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
print(df_master.info())
Ouput:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 279 entries, 2020-05-23 02:00:00 to 2020-05-23 03:00:00
Data columns (total 4 columns):
Bezirk 279 non-null object
Anzahl 279 non-null int64
Anzahl_Inzidenz 279 non-null object
GKZ 279 non-null int64
The files data is getting appended now. Not sure if this is the best solution in terms of optimisation. Kindly suggest changes for improvement, if applicable.
Dataframe Output:
Bezirk Anzahl Anzahl_Inzidenz GKZ
Timestamp
2020-05-23 02:00:00 Amstetten 301 259,228000068898 305
2020-05-23 02:00:00 Baden 157 107,384937381586 306
2020-05-23 02:00:00 Bludenz 256 401,795523746743 801
2020-05-23 02:00:00 Braunau am Inn 130 124,511531683396 404
2020-05-23 02:00:00 Bregenz 274 203,894837888721 802
2020-05-23 02:00:00 Bruck an der Leitha 105 102,931085187727 307
2020-05-23 02:00:00 Bruck-Mürzzuschlag 52 52,5337428271236 621
2020-05-23 02:00:00 Deutschlandsberg 49 80,564278785288 603
2020-05-23 02:00:00 Dornbirn 153 171,830954279489 803
2020-05-23 02:00:00 Eferding 53 159,850404150078 405
2020-05-23 02:00:00 Eisenstadt(Stadt) 18 122,976019676163 101
2020-05-23 02:00:00 Eisenstadt-Umgebung 23 53,5793323549281 103
2020-05-23 02:00:00 Feldkirch 213 198,770051978835 804
2020-05-23 02:00:00 Feldkirchen 15 50,1052209640244 210
2020-05-23 02:00:00 Freistadt 120 180,123384518395 406
2020-05-23 02:00:00 Gänserndorf 99 95,4805856142584 308
2020-05-23 02:00:00 Gmünd 14 38,0714110896582 309
2020-05-23 02:00:00 Gmunden 73 71,8284775314619 407
2020-05-23 02:00:00 Graz(Stadt) 466 161,353988490544 601
2020-05-23 02:00:00 Graz-Umgebung 246 159,471022948269 606
2020-05-23 02:00:00 Grieskirchen 74 114,336923100694 408
2020-05-23 02:00:00 Güssing 24 93,0340737295034 104
2020-05-23 02:00:00 Hallein 69 114,287607248153 502
2020-05-23 02:00:00 Hartberg-Fürstenfeld 318 350,908167994527 622
2020-05-23 02:00:00 Hermagor 5 27,4363476733977 203
2020-05-23 02:00:00 Hollabrunn 37 72,7515828384915 310
2020-05-23 02:00:00 Horn 58 186,55516243165 311
2020-05-23 02:00:00 Imst 293 487,877980551485 702
2020-05-23 02:00:00 Innsbruck-Land 388 216,375377820408 703
2020-05-23 02:00:00 Innsbruck-Stadt 410 310,347437741276 701
... ... ... ... ...
2020-05-23 03:00:00 Sankt Johann im Pongau 434 538,641976840877 504
2020-05-23 03:00:00 Sankt Pölten(Land) 267 203,748359329691 319
2020-05-23 03:00:00 Sankt Pölten(Stadt) 90 163,505559189012 302
2020-05-23 03:00:00 Sankt Veit an der Glan 52 95,3166529190725 205
2020-05-23 03:00:00 Schärding 58 101,209276353674 414
2020-05-23 03:00:00 Scheibbs 103 248,774243412313 320
2020-05-23 03:00:00 Schwaz 352 419,68213847126 709
2020-05-23 03:00:00 Spittal an der Drau 42 55,1970666701712 206
2020-05-23 03:00:00 Steyr(Stadt) 62 162,333411881758 402
2020-05-23 03:00:00 Steyr-Land 181 299,534976086849 415
2020-05-23 03:00:00 Südoststeiermark 62 72,1374800749299 623
2020-05-23 03:00:00 Tamsweg 22 108,267716535433 505
2020-05-23 03:00:00 Tulln 175 168,640564319511 321
2020-05-23 03:00:00 Urfahr-Umgebung 281 328,635752295187 416
2020-05-23 03:00:00 Villach Land 48 74,2252737056968 207
2020-05-23 03:00:00 Villach Stadt 21 33,7387336728628 202
2020-05-23 03:00:00 Vöcklabruck 107 78,5303809824371 417
2020-05-23 03:00:00 Voitsberg 106 207,189069799261 616
2020-05-23 03:00:00 Völkermarkt 52 124,170208701466 208
2020-05-23 03:00:00 Waidhofen an der Thaya 73 281,983930778739 322
2020-05-23 03:00:00 Waidhofen an der Ybbs(Stadt) 25 222,005150519492 303
2020-05-23 03:00:00 Weiz 197 218,057846208339 617
2020-05-23 03:00:00 Wels(Stadt) 58 93,9621235439921 403
2020-05-23 03:00:00 Wels-Land 79 108,080006566886 418
2020-05-23 03:00:00 Wien(Stadt) 3048 160,633172963666 900
2020-05-23 03:00:00 Wiener Neustadt(Land) 86 110,269133617982 323
2020-05-23 03:00:00 Wiener Neustadt(Stadt) 56 123,683106212867 304
2020-05-23 03:00:00 Wolfsberg 43 81,5536926753404 209
2020-05-23 03:00:00 Zell am See 359 410,463972925385 506
2020-05-23 03:00:00 Zwettl 73 172,895646819194 325
[279 rows x 4 columns]
Finally, you can write the output to the csv file. I am not sure if you want to append the file name to the output csv file as your current code does because its going to the same name 'Bezirke.csv' everytime.
Upvotes: 2