BeautifulSoup not parsing headers or paragraphs

Question

I'm trying to implement a web harvesting using requests and BeautifulSoup. The web crawler code is working correctly, but the extraction piece isn't working. The only data being written to the output file is the header row. I've looked at dozens of examples online, but still haven't been able to correct my problem. Where am I going wrong?

secondSoupParser = BeautifulSoup(raw_html, 'html.parser')
list_of_headers = [] 
list_of_paras = [] 

  try:     
   results_parser = secondSoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
 except AttributeError as e:     
   logging.exception(e)     
   sys.exit(1)  

for div in results_parser.findAll('h2'):     
   for para in div.findAll('p'):         
     para_text = para.text.strip()         
     list_of_paras.append(para_text)                  
     list_of_headers.append(list_of_paras)

filenameTest = (output_directory + '/'+ 'test' + '-' + timestamp + '.csv')    
output_file2 = open(filenameTest, 'w', encoding='utf8')  

writer2 = csv.writer(output_file2) 
writer2.writerow(['Test']) 
writer2.writerow(list_of_headers)

UPDATE

The target url format is:


    Last revised: A date is here
    Header One
    Some text goes here.
    Header Two
    Some text goes here.
    Header Three
    Some text goes here.
    Header Four
    Some text goes here.
    Header Five
    Some text goes here.
    Header Six
    Some text goes here.

gerosalesc · Accepted Answer

The

tags are not contained into the

tags, so there is no need to loop over

first. This should be working good enough to extract

's texts to the lists:

results_parser = secondSoupParser.find('div', attrs={'style': 'padding-left:10px;width:98%'})

for para in results_parser.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)
    list_of_headers.append(list_of_paras)

BeautifulSoup not parsing headers or paragraphs

UPDATE

Answers (1)

Related Questions