Reputation: 394
I am attempting to scrape some simple dictionary information from an html page. So far I am able to print all the words I need on the IDE. My next step was to transfer the words to an array. My last step was to save the array as a csv file... When I run my code it seems to stop taking information after the 1309th or 1311th word, although I believe there to be over 1 million on the web page. I am stuck and would be very appreciative of any help. Thank you
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
html = urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_a.html').read()
soup = BeautifulSoup(html,"lxml")
words = []
for section in soup.findAll('b'):
words.append(section.renderContents())
print ('success')
print (len(words))
myfile = open('A.csv', 'wb')
wr = csv.writer(myfile)
wr.writerow(words)
Upvotes: 4
Views: 341
Reputation: 8147
I suspect a good deal of your problem may lie in how you're processing the scraped content. Do you need to scrape all the content before you output it to the file? Or can you do it as you go?
Instead of appending over and over to a list, you should use yield
.
def tokenize(soup_):
for section in soup_.findAll('b'):
yield section.renderContents()
This'll give you a generator that as long as section.renderContents() returns a string, the csv module can write out with no problem.
Upvotes: 0
Reputation: 473803
I was not able to reproduce the problem (always getting 11616 items), but I suspect you have outdated beautifulsoup4
or lxml
versions installed. Upgrade:
pip install --upgrade beautifulsoup4
pip install --upgrade lxml
Of course, this is just a theory.
Upvotes: 1