Life is complex
Life is complex

Reputation: 15619

BeautifulSoup not parsing headers or paragraphs

I'm trying to implement a web harvesting using requests and BeautifulSoup. The web crawler code is working correctly, but the extraction piece isn't working. The only data being written to the output file is the header row. I've looked at dozens of examples online, but still haven't been able to correct my problem. Where am I going wrong?

secondSoupParser = BeautifulSoup(raw_html, 'html.parser')
list_of_headers = []

list_of_paras = [] 



try:
    
   results_parser = secondSoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})

except AttributeError as e:
    
   logging.exception(e)
    
   sys.exit(1)



for div in results_parser.findAll('h2'):
    
   for para in div.findAll('p'):
        
     para_text = para.text.strip()
        
     list_of_paras.append(para_text)
                 
     list_of_headers.append(list_of_paras)

filenameTest = (output_directory + '/'+ 'test' + '-' + timestamp + '.csv')
   
output_file2 = open(filenameTest, 'w', encoding='utf8')



writer2 = csv.writer(output_file2)

writer2.writerow(['Test'])

writer2.writerow(list_of_headers)


UPDATE

The target url format is:

<div style="padding-left:10px;width:98%">
    <p><i>Last revised: A date is here</i></p>
    <h2>Header One</h2>
    <p>Some text goes here.</p>
    <h2>Header Two</h2>
    <p>Some text goes here.</p>
    <h2>Header Three</h2>
    <p>Some text goes here.</p>
    <h2>Header Four</h2>
    <p>Some text goes here.</p>
    <h2>Header Five</h2>
    <p>Some text goes here.</p>
    <h2>Header Six</h2>
    <p>Some text goes here.</p>
</div>

Upvotes: 0

Views: 333

Answers (1)

gerosalesc
gerosalesc

Reputation: 3063

The <p> tags are not contained into the <h2> tags, so there is no need to loop over <h2> first. This should be working good enough to extract <p>'s texts to the lists:

results_parser = secondSoupParser.find('div', attrs={'style': 'padding-left:10px;width:98%'})

for para in results_parser.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)
    list_of_headers.append(list_of_paras)

Upvotes: 1

Related Questions