Getting soup contents into structured csv

Question

I'm trying to scrape a website and place it into a structured data format. I'd like to end up with a .csv with 6 columns: country, date, general_text, fiscal_text, monetary_text, fx_text.

The mapping is like this:

country <- h3
date <- h6
general_text <- p (h3) (the p tag that follows the h3 header)
fiscal_text  <- p (1st h5 ul li) (the p tag that follows the **first** h5. This tag is inside ul and li blocks)
monetary_text <- p (2nd h5 ul li) (the p tag that follows the **second** h5. This tag is inside ul and li blocks)
fx_text <- p (3rd h5 ul li) (the p tag that follows the **third** h5. This tag is inside ul and li blocks)

The pattern ends at the next h3 (country) heading.

I'm finding it difficult to get each element in its proper place/column.

The site structure repeats this for each country (see below for the actual tags):

h3
 p
h6
h5
 ul
  li
   p
h5
 ul
  li
   p
h5
 ul
  li
   p

I have the following code for simple text extraction:

import requests
import io
import csv 
from bs4 import BeautifulSoup
from urllib.request import urlopen
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(class_='rr-intro')

with io.open('test.txt', 'w', encoding='utf8') as f:
    for header in results.find_all(['h3', 'h6', 'h5']):
        f.write(header.get_text() + u'
') 
        for elem in header.next_siblings:
            if elem.name and elem.name.startswith('h'):
                # stop at next header
                break
            if elem.name and elem.find_all('p'):
                f.write(elem.get_text() + u'
')

From the comments, I thought it makes sense instead to create lists and somehow zip them. I tried this:

h3 = results.find_all('h3')
h6 = results.find_all('h6')
h5 = results.find_all('h5')
h5f = results.find_all('h5', text='Fiscal')
h5m = results.find_all('h5', text='Monetary and macro-financial')
h5x = results.find_all('h5', text='Exchange rate and balance of payments')
country = [country.get_text() for country in h3]  #list of countries
date = [date.get_text() for date in h6]  #date string

I'm stuck here. Not sure how to get the contents of the p-tags to go to the right place in a list so it would be zipped, or directly to a csv.

I'm a python rookie, so I made these based on what I found on stackoverflow. Any help would be greatly appreciated.

Edit: To clarify, The structure of what I want looks like this.



 
  Country 1
 
 
  summary text
 
 
  date
 
 
  Fiscal
 
 
  
   
    text for fiscal of country 1
   
  
 
 
  Monetary and macro-financial
 
 
  
   
    text for monetary of country 1
   
  
 
 
  Exchange rate and balance of payments
 
 
  
   
    text for FX of country 1
   
  
 
  
  Country 2
 
 
  summary text
 
 
  date
 
 
  Fiscal
 
 
  
   
    text for fiscal of country 2
   
  
 
 
  Monetary and macro-financial
 
 
  
   
    text for monetary of country 2
   
  
 
 
  Exchange rate and balance of payments
 
 
  
   
    text for FX of country 2
   
  
 
   
  Country 3

etc...

Martin Evans · Accepted Answer

I feel the easiest approach is to deal with the elements as you read them in order. To do this you could keep track of the current section and then append text into that section.

A Python CSV DictWriter can be used to write a row of information once the next h3 country is found. For example:

from collections import defaultdict
import requests
import csv 
from bs4 import BeautifulSoup

URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find('div', class_='rr-intro')

section_lookup = {
    'Fiscal' : 'fiscal_text',
    'Moneta' : 'monetary_text',
    'Macro-' : 'monetary_text',
    'Exchan' : 'fx_text',
}

with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
    fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
    csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
    csv_output.writeheader()

    row = defaultdict(str)
    section = None

    for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
        if elem.name == 'h3':
            if row:
                csv_output.writerow(row)
                row = defaultdict(str)

            row['country'] = elem.get_text(strip=True)
            section = "general_text"

        elif elem.name == 'h5':
            section = section_lookup[elem.get_text(strip=True)[:6]]
        elif elem.name == 'h6':
            row['date'] = elem.get_text(strip=True)[27:]
        elif elem.name == 'p' and section:
            row[section] = f"{row[section]} {elem.get_text(strip=True)}"

    if row:
        csv_output.writerow(row)

Giving you a data.csv file starting:

Getting soup contents into structured csv

Answers (1)

Related Questions