Reputation: 941
I wrote a script which hit an URL and downloads a zip file, unzip it. Now I am facing problem while parsing CSV file which I get after unzip.
import csv
from requests import get
from io import BytesIO
from zipfile import ZipFile
request = get('https://example.com/some_file.zip')
zip_file = ZipFile(BytesIO(request.content))
files = zip_file.namelist()
with open(files[0], 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
Upvotes: 5
Views: 4666
Reputation: 40884
See above the answer by @joe-heffer: https://stackoverflow.com/a/53187751/223424
When you do files = zip_file.namelist()
, you just list the names of the files in the zip archive; these files are not yet extracted from the zip and you cannot open
them as local files, like you're doing.
You can directly read a stream of data from a zip file using ZipFile.open
.
So this should work:
zip_file = ZipFile(BytesIO(request.content))
files = zip_file.namelist()
with zip_file.open(files[0], 'r') as csvfile:
csvreader = csv.reader(csvfile)
...
Upvotes: 4
Reputation: 17
So. After some hours of searching and trying, I finally got something working. Here is my script.
So my need was:
#!/bin/env python
from io import BytesIO
from zipfile import ZipFile
import requests
import re
import sys
# define url value
url = "https://whateverurlyouneed"
# Define string to be found in the file name to be extracted
filestr = "anystring"
# Define string to be found in URL
urlstr = "anystring"
# Define regex to extract URL
regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"
# download zip file
content = requests.get(url)
# Open stream
zipfile = ZipFile(BytesIO(content.content))
# Open first file from the ZIP archive containing
# the filestr string in the name
data = [zipfile.open(file_name) for file_name in zipfile.namelist() if filestr in file_name][0]
# read lines from the file. If csv found, print URL and exit
# This will return the 1st URL containing CSV in the opened file
for line in data.readlines():
if urlstr in line.decode("latin-1"):
urls = re.findall(regularex,line.decode("latin-1"))
print([url[0] for url in urls])
break
sys.exit(0)
Upvotes: 0
Reputation: 399
response = requests.get(url)
with io.BytesIO(response.content) as zip_file:
with zipfile.ZipFile() as zip_file:
# Get first file in the archive
for zip_info in zip_file.infolist():
logger.debug(zip_info)
# Open file
with zip_file.open(zip_info) as file:
# Load CSV file, decode binary to text
with io.TextIOWrapper(file) as text:
return csv.DictReader(text)
Upvotes: 1
Reputation: 453
Looks like you haven't imported the csv
module. Try putting import csv
at the top with your imports.
Upvotes: 0