Reputation: 15619
I'm trying to implement a web harvesting using requests and BeautifulSoup. The web crawler code is working correctly, but the extraction piece isn't working. The only data being written to the output file is the header row. I've looked at dozens of examples online, but still haven't been able to correct my problem. Where am I going wrong?
secondSoupParser = BeautifulSoup(raw_html, 'html.parser')
list_of_headers = []
list_of_paras = []
try:
results_parser = secondSoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
except AttributeError as e:
logging.exception(e)
sys.exit(1)
for div in results_parser.findAll('h2'):
for para in div.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
list_of_headers.append(list_of_paras)
filenameTest = (output_directory + '/'+ 'test' + '-' + timestamp + '.csv')
output_file2 = open(filenameTest, 'w', encoding='utf8')
writer2 = csv.writer(output_file2)
writer2.writerow(['Test'])
writer2.writerow(list_of_headers)
The target url format is:
<div style="padding-left:10px;width:98%">
<p><i>Last revised: A date is here</i></p>
<h2>Header One</h2>
<p>Some text goes here.</p>
<h2>Header Two</h2>
<p>Some text goes here.</p>
<h2>Header Three</h2>
<p>Some text goes here.</p>
<h2>Header Four</h2>
<p>Some text goes here.</p>
<h2>Header Five</h2>
<p>Some text goes here.</p>
<h2>Header Six</h2>
<p>Some text goes here.</p>
</div>
Upvotes: 0
Views: 333
Reputation: 3063
The <p>
tags are not contained into the <h2>
tags, so there is no need to loop over <h2>
first. This should be working good enough to extract <p>
's texts to the lists:
results_parser = secondSoupParser.find('div', attrs={'style': 'padding-left:10px;width:98%'})
for para in results_parser.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
list_of_headers.append(list_of_paras)
Upvotes: 1