Python - Extract CSV Files from Multiple Zip Files and Combine the Data

Question

I have a Python script and pandas to combine multiple ZIP files. I am using data hosted in a GitHub repository here: https://github.com/statistikat/coronaDAT

The script should take all ZIP files in a folder structure, find the "Bezirke.csv" file in the ZIP file, and combine all the Bezirke.csv files into one large CSV file. However, the code is only grabbing one ZIP file from the folder.

Any suggestions on why the it is not taking the data from the other ZIP files in the folder?

import glob
from zipfile import ZipFile

path = r'/Users/matt/test/' # use your path

#load all zip files in folder
all_files = glob.glob(path + "/*.zip")

li = []

for filename in all_files:

    zip_file = ZipFile(filename)
    df = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename), delimiter=';', header=0, index_col=['Timestamp'], parse_dates=['Timestamp'])
       for text_file in zip_file.infolist()
       if text_file.filename.endswith('Bezirke.csv')}

    li.append(df)

#print dataframe in console
print(df)

#prepare date to export to csv
frame = pd.concat(df, axis=0)

#export to csv
frame.to_csv( "combined_zip_Bezirke.csv", encoding='utf-8-sig')
print("Export to CSV Successful")```

Anshul · Accepted Answer

I made some changes to your code and tested it with three sample files:

20200523_010000_orig_csv.zip
20200523_020000_orig_csv.zip
20200523_030000_orig_csv.zip

Modified code:

df_master = pd.DataFrame()
flag = False

for filename in all_files:    
    zip_file = ZipFile(filename)
    for text_file in zip_file.infolist():
        if text_file.filename.endswith('Bezirke.csv'):
            df = pd.read_csv(zip_file.open(text_file.filename), 
            delimiter=';', 
            header=0, 
            index_col=['Timestamp'], 
            parse_dates=['Timestamp']
            )
    if not flag:
        df_master = df
        flag = True
    else:
        df_master = pd.concat([df_master, df])

print(df_master.info())

Ouput:


DatetimeIndex: 279 entries, 2020-05-23 02:00:00 to 2020-05-23 03:00:00
Data columns (total 4 columns):
Bezirk             279 non-null object
Anzahl             279 non-null int64
Anzahl_Inzidenz    279 non-null object
GKZ                279 non-null int64

The files data is getting appended now. Not sure if this is the best solution in terms of optimisation. Kindly suggest changes for improvement, if applicable.

Dataframe Output:

                                           Bezirk  Anzahl   Anzahl_Inzidenz  GKZ
Timestamp                                                                       
2020-05-23 02:00:00                     Amstetten     301  259,228000068898  305
2020-05-23 02:00:00                         Baden     157  107,384937381586  306
2020-05-23 02:00:00                       Bludenz     256  401,795523746743  801
2020-05-23 02:00:00                Braunau am Inn     130  124,511531683396  404
2020-05-23 02:00:00                       Bregenz     274  203,894837888721  802
2020-05-23 02:00:00           Bruck an der Leitha     105  102,931085187727  307
2020-05-23 02:00:00            Bruck-Mürzzuschlag      52  52,5337428271236  621
2020-05-23 02:00:00              Deutschlandsberg      49   80,564278785288  603
2020-05-23 02:00:00                      Dornbirn     153  171,830954279489  803
2020-05-23 02:00:00                      Eferding      53  159,850404150078  405
2020-05-23 02:00:00             Eisenstadt(Stadt)      18  122,976019676163  101
2020-05-23 02:00:00           Eisenstadt-Umgebung      23  53,5793323549281  103
2020-05-23 02:00:00                     Feldkirch     213  198,770051978835  804
2020-05-23 02:00:00                   Feldkirchen      15  50,1052209640244  210
2020-05-23 02:00:00                     Freistadt     120  180,123384518395  406
2020-05-23 02:00:00                   Gänserndorf      99  95,4805856142584  308
2020-05-23 02:00:00                         Gmünd      14  38,0714110896582  309
2020-05-23 02:00:00                       Gmunden      73  71,8284775314619  407
2020-05-23 02:00:00                   Graz(Stadt)     466  161,353988490544  601
2020-05-23 02:00:00                 Graz-Umgebung     246  159,471022948269  606
2020-05-23 02:00:00                  Grieskirchen      74  114,336923100694  408
2020-05-23 02:00:00                       Güssing      24  93,0340737295034  104
2020-05-23 02:00:00                       Hallein      69  114,287607248153  502
2020-05-23 02:00:00          Hartberg-Fürstenfeld     318  350,908167994527  622
2020-05-23 02:00:00                      Hermagor       5  27,4363476733977  203
2020-05-23 02:00:00                    Hollabrunn      37  72,7515828384915  310
2020-05-23 02:00:00                          Horn      58   186,55516243165  311
2020-05-23 02:00:00                          Imst     293  487,877980551485  702
2020-05-23 02:00:00                Innsbruck-Land     388  216,375377820408  703
2020-05-23 02:00:00               Innsbruck-Stadt     410  310,347437741276  701
...                                           ...     ...               ...  ...
2020-05-23 03:00:00        Sankt Johann im Pongau     434  538,641976840877  504
2020-05-23 03:00:00            Sankt Pölten(Land)     267  203,748359329691  319
2020-05-23 03:00:00           Sankt Pölten(Stadt)      90  163,505559189012  302
2020-05-23 03:00:00        Sankt Veit an der Glan      52  95,3166529190725  205
2020-05-23 03:00:00                     Schärding      58  101,209276353674  414
2020-05-23 03:00:00                      Scheibbs     103  248,774243412313  320
2020-05-23 03:00:00                        Schwaz     352   419,68213847126  709
2020-05-23 03:00:00           Spittal an der Drau      42  55,1970666701712  206
2020-05-23 03:00:00                  Steyr(Stadt)      62  162,333411881758  402
2020-05-23 03:00:00                    Steyr-Land     181  299,534976086849  415
2020-05-23 03:00:00              Südoststeiermark      62  72,1374800749299  623
2020-05-23 03:00:00                       Tamsweg      22  108,267716535433  505
2020-05-23 03:00:00                         Tulln     175  168,640564319511  321
2020-05-23 03:00:00               Urfahr-Umgebung     281  328,635752295187  416
2020-05-23 03:00:00                  Villach Land      48  74,2252737056968  207
2020-05-23 03:00:00                 Villach Stadt      21  33,7387336728628  202
2020-05-23 03:00:00                   Vöcklabruck     107  78,5303809824371  417
2020-05-23 03:00:00                     Voitsberg     106  207,189069799261  616
2020-05-23 03:00:00                   Völkermarkt      52  124,170208701466  208
2020-05-23 03:00:00        Waidhofen an der Thaya      73  281,983930778739  322
2020-05-23 03:00:00  Waidhofen an der Ybbs(Stadt)      25  222,005150519492  303
2020-05-23 03:00:00                          Weiz     197  218,057846208339  617
2020-05-23 03:00:00                   Wels(Stadt)      58  93,9621235439921  403
2020-05-23 03:00:00                     Wels-Land      79  108,080006566886  418
2020-05-23 03:00:00                   Wien(Stadt)    3048  160,633172963666  900
2020-05-23 03:00:00         Wiener Neustadt(Land)      86  110,269133617982  323
2020-05-23 03:00:00        Wiener Neustadt(Stadt)      56  123,683106212867  304
2020-05-23 03:00:00                     Wolfsberg      43  81,5536926753404  209
2020-05-23 03:00:00                   Zell am See     359  410,463972925385  506
2020-05-23 03:00:00                        Zwettl      73  172,895646819194  325

[279 rows x 4 columns]

Finally, you can write the output to the csv file. I am not sure if you want to append the file name to the output csv file as your current code does because its going to the same name 'Bezirke.csv' everytime.

Python - Extract CSV Files from Multiple Zip Files and Combine the Data

Answers (1)

Related Questions