flyingtoaster
flyingtoaster

Reputation: 17

Issues opening and reading txt files in a zip folder in Pandas

My current code is only opening one txt file in a zip folder when there are 4 txt files. I want to read in those txt files to a csv but unsure why it's not reading all of them. I am going to assume it's due to zip_csv_file = zip_csv_files[0] but I am unsure how I can modify my current code to parse all the .txt files in a given zip folder.

Code:

def custom_parse(self, response):
        self.logger.info(response.url)
        links = response.xpath("//a[contains(@href, '.zip')]/@href").getall()
        for link in list(set(links)):
            print(link)
            local_path = self.download_file("https://www.sec.gov" + link)
            zip_file = zipfile.ZipFile(local_path)
            zip_csv_files = [file_name for file_name in zip_file.namelist() if file_name.endswith(".txt") and "pre" not in file_name]
            zip_csv_file = zip_csv_files[0]
            with zip_file.open(zip_csv_file, "r") as zip:
                df = pd.read_csv(BytesIO(zip.read()), dtype=str, sep='\t')

            df = self.standardized(df)
            for k, row in df.iterrows():
                yield dict(row)

Edit:

with zip_file.open(zip_csv_file, "r") as zip:
UnboundLocalError: local variable 'zip_csv_file' referenced before assignment

Upvotes: 0

Views: 98

Answers (1)

heretolearn
heretolearn

Reputation: 6545

You can try like this if you want data from all the files into a single data frame:

path = <path to the zip file>
df = pd.concat(
    [pd.read_csv(zipfile.ZipFile(path).open(text_file)) 
     for text_file in zipfile.ZipFile(path).infolist() 
     if text_file.filename.endswith('.txt') and "pre" not in text_file.filename],
    ignore_index=True
)

If you want each file as a different data frame:

path = <path to the zip file>
zip_file = zipfile.ZipFile(path)
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
       for text_file in zip_file.infolist()
       if text_file.filename.endswith('.txt') and "pre" not in text_file.filename}

This will give you a dict of data frames, with key as the filename

Upvotes: 1

Related Questions