Reputation: 17
My current code is only opening one txt file in a zip folder when there are 4 txt files. I want to read in those txt files to a csv but unsure why it's not reading all of them. I am going to assume it's due to zip_csv_file = zip_csv_files[0]
but I am unsure how I can modify my current code to parse all the .txt files in a given zip folder.
Code:
def custom_parse(self, response):
self.logger.info(response.url)
links = response.xpath("//a[contains(@href, '.zip')]/@href").getall()
for link in list(set(links)):
print(link)
local_path = self.download_file("https://www.sec.gov" + link)
zip_file = zipfile.ZipFile(local_path)
zip_csv_files = [file_name for file_name in zip_file.namelist() if file_name.endswith(".txt") and "pre" not in file_name]
zip_csv_file = zip_csv_files[0]
with zip_file.open(zip_csv_file, "r") as zip:
df = pd.read_csv(BytesIO(zip.read()), dtype=str, sep='\t')
df = self.standardized(df)
for k, row in df.iterrows():
yield dict(row)
Edit:
with zip_file.open(zip_csv_file, "r") as zip:
UnboundLocalError: local variable 'zip_csv_file' referenced before assignment
Upvotes: 0
Views: 98
Reputation: 6545
You can try like this if you want data from all the files into a single data frame:
path = <path to the zip file>
df = pd.concat(
[pd.read_csv(zipfile.ZipFile(path).open(text_file))
for text_file in zipfile.ZipFile(path).infolist()
if text_file.filename.endswith('.txt') and "pre" not in text_file.filename],
ignore_index=True
)
If you want each file as a different data frame:
path = <path to the zip file>
zip_file = zipfile.ZipFile(path)
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
for text_file in zip_file.infolist()
if text_file.filename.endswith('.txt') and "pre" not in text_file.filename}
This will give you a dict of data frames, with key as the filename
Upvotes: 1