Reputation: 2876
I'm new to python trying to build my first script. I want to scrap a list of url and export it in a csv file.
My script is well executed however when opening the csv file only few line of data are written. When I'm printing the list I'm trying to write (sharelist
and sharelist1
), the print is complete whereas the csv file is not.
Here is a part of my code :
for url in urllist[10:1000]:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # eheck the return code
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append(url)
sharelist1.append(share)
# open a file for writing.
csv_out = open('mycsv.csv', 'wb')
# create the csv writer object.
mywriter = csv.writer(csv_out)
# writerow - one row of data at a time.
for row in zip(sharelist, sharelist1):
mywriter.writerow(row)
# always make sure that you close the file.
# otherwise you might find that it is empty.
csv_out.close()
Not sure which part of my code I should share here. Please tell me if it's not enough !
Upvotes: 0
Views: 116
Reputation: 13
Open a file for writing, using a context manager, that way, you don't need to close the file explicitly.
with open('mycsv.csv', 'w') as file_obj:
mywriter = csv.writer(file_obj)
for url in urllist[10:1000]:
try:
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # check the return code
continue
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip()
# no need to use zip, and append in 2 lists as they're really expensive calls,
# and by the looks of it, I think it'll create duplicate rows in your file
mywriter.writerow((url, share))
Upvotes: 0
Reputation: 1320
The problem has been found and the best solution for files is to use with
keyword which permits to close the file at the end automatically :
with open('mycsv.csv', 'wb') as csv_out:
mywriter = csv.writer(csv_out)
for url in urllist[10:1000]:
try:
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404:
continue
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip()
# save the data in tuple
sharelist.append(url)
sharelist1.append(share)
for row in zip(sharelist, sharelist1):
mywriter.writerow(row)
Upvotes: 1
Reputation: 994
The problem is that you are opening the file every time you run through the loop. This essentially will overwrite the previous file.
# open a file for writing.
csv_out = open('mycsv.csv', 'wb')
# create the csv writer object.
mywriter = csv.writer(csv_out)
# writerow - one row of data at a time.
for row in zip(sharelist, sharelist1):
mywriter.writerow(row)
# always make sure that you close the file.
# otherwise you might find that it is empty.
csv_out.close()
Either open the file before the loop, or open it with the append option.
This is option one (note the indentation):
# open a file for writing.
csv_out = open('mycsv.csv', 'wb')
# create the csv writer object.
mywriter = csv.writer(csv_out)
for url in urllist[10:1000]:
try:
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # eheck the return code
continue
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip()
# save the data in tuple
sharelist.append(url)
sharelist1.append(share)
# writerow - one row of data at a time.
for row in zip(sharelist, sharelist1):
mywriter.writerow(row)
# always make sure that you close the file.
# otherwise you might find that it is empty.
csv_out.close()
This is option 2:
for url in urllist[10:1000]:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # eheck the return code
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append(url)
sharelist1.append(share)
# open a file for writing.
csv_out = open('mycsv.csv', 'ab')
# create the csv writer object.
mywriter = csv.writer(csv_out)
# writerow - one row of data at a time.
for row in zip(sharelist, sharelist1):
mywriter.writerow(row)
# always make sure that you close the file.
# otherwise you might find that it is empty.
csv_out.close()
Upvotes: 3