rgh_dsa
rgh_dsa

Reputation: 117

Python - Pandas Concatenate Multiple Text Files Within Multiple Zip Files

I am having problems getting txt files located in zipped files to load/concatenate using pandas. There are many examples on here with pd.concat(zip_file.open) but still not getting anything to work in my case since I have more than one zip file and multiple txt files in each.

For example, Lets say I have TWO Zipped files in a specific folder "Main". Each zipped file contains FIVE txt files each. I want to read all of these txt files and pd.concat them all together. In my real world example I will have dozens of zip folders with each containing five txt files.

Can you help please?

Folder and File Structure for Example:

'C:/User/Example/Main'   
   TAG_001.zip
     sample001_1.txt
     sample001_2.txt
     sample001_3.txt
     sample001_4.txt
     sample001_5.txt
   TAG_002.zip
     sample002_1.txt
     sample002_2.txt
     sample002_3.txt
     sample002_4.txt
     sample002_5.txt

I started like this but everything after this is throwing errors:

import os
import glob
import pandas as pd
import zipfile

path = 'C:/User/Example/Main'

ziplist = glob.glob(os.path.join(path, "*TAG*.zip"))

Upvotes: 3

Views: 1048

Answers (1)

John
John

Reputation: 13699

This isn't efficient but it should give you some idea of how it might be done.

import os
import zipfile

import pandas as pd

frames = {}

BASE_DIR = 'C:/User/Example/Main'
_, _, zip_filenames = list(os.walk(BASE_DIR))[0]
for zip_filename in zip_filenames:
    with zipfile.ZipFile(os.path.join(BASE_DIR, zip_filename)) as zip_:
        for filename in zip_.namelist():
            with zip_.open(filename) as file_:
                new_frame = pd.read_csv(file_, sep='\t')
                frame = frames.get(filename)
                if frame is not None:
                    pd.concat([frame, new_frame])
                else:
                    frames[filename] = new_frame

#once all frames have been concatenated loop over the dict and write them back out

depending on how much data there is you will have to design a solution that balances processing power/memory/disk space. This solution could potentially use up a lot of memory.

Upvotes: 1

Related Questions