Reputation: 35
I'm trying to scrape a website and place it into a structured data format. I'd like to end up with a .csv with 6 columns: country, date, general_text, fiscal_text, monetary_text, fx_text.
The mapping is like this:
country <- h3
date <- h6
general_text <- p (h3) (the p tag that follows the h3 header)
fiscal_text <- p (1st h5 ul li) (the p tag that follows the **first** h5. This tag is inside ul and li blocks)
monetary_text <- p (2nd h5 ul li) (the p tag that follows the **second** h5. This tag is inside ul and li blocks)
fx_text <- p (3rd h5 ul li) (the p tag that follows the **third** h5. This tag is inside ul and li blocks)
The pattern ends at the next h3 (country) heading.
I'm finding it difficult to get each element in its proper place/column.
The site structure repeats this for each country (see below for the actual tags):
h3
p
h6
h5
ul
li
p
h5
ul
li
p
h5
ul
li
p
I have the following code for simple text extraction:
import requests
import io
import csv
from bs4 import BeautifulSoup
from urllib.request import urlopen
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='rr-intro')
with io.open('test.txt', 'w', encoding='utf8') as f:
for header in results.find_all(['h3', 'h6', 'h5']):
f.write(header.get_text() + u'\n')
for elem in header.next_siblings:
if elem.name and elem.name.startswith('h'):
# stop at next header
break
if elem.name and elem.find_all('p'):
f.write(elem.get_text() + u'\n')
From the comments, I thought it makes sense instead to create lists and somehow zip them. I tried this:
h3 = results.find_all('h3')
h6 = results.find_all('h6')
h5 = results.find_all('h5')
h5f = results.find_all('h5', text='Fiscal')
h5m = results.find_all('h5', text='Monetary and macro-financial')
h5x = results.find_all('h5', text='Exchange rate and balance of payments')
country = [country.get_text() for country in h3] #list of countries
date = [date.get_text() for date in h6] #date string
I'm stuck here. Not sure how to get the contents of the p-tags to go to the right place in a list so it would be zipped, or directly to a csv.
I'm a python rookie, so I made these based on what I found on stackoverflow. Any help would be greatly appreciated.
Edit: To clarify, The structure of what I want looks like this.
<div class="rr-intro">
<h3>
Country 1
</h3>
<p>
summary text
</p>
<h6>
date
</h6>
<h5>
Fiscal
</h5>
<ul>
<li>
<p>
text for fiscal of country 1
</p>
</li>
</ul>
<h5>
Monetary and macro-financial
</h5>
<ul>
<li>
<p>
text for monetary of country 1
</p>
</li>
</ul>
<h5>
Exchange rate and balance of payments
</h5>
<ul>
<li>
<p>
text for FX of country 1
</p>
</li>
</ul>
<h3>
Country 2
</h3>
<p>
summary text
</p>
<h6>
date
</h6>
<h5>
Fiscal
</h5>
<ul>
<li>
<p>
text for fiscal of country 2
</p>
</li>
</ul>
<h5>
Monetary and macro-financial
</h5>
<ul>
<li>
<p>
text for monetary of country 2
</p>
</li>
</ul>
<h5>
Exchange rate and balance of payments
</h5>
<ul>
<li>
<p>
text for FX of country 2
</p>
</li>
</ul>
<h3>
Country 3
</h3>
etc...
Upvotes: 2
Views: 108
Reputation: 46759
I feel the easiest approach is to deal with the elements as you read them in order. To do this you could keep track of the current section and then append text into that section.
A Python CSV DictWriter
can be used to write a row of information once the next h3
country is found. For example:
from collections import defaultdict
import requests
import csv
from bs4 import BeautifulSoup
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='rr-intro')
section_lookup = {
'Fiscal' : 'fiscal_text',
'Moneta' : 'monetary_text',
'Macro-' : 'monetary_text',
'Exchan' : 'fx_text',
}
with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
csv_output.writeheader()
row = defaultdict(str)
section = None
for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
if elem.name == 'h3':
if row:
csv_output.writerow(row)
row = defaultdict(str)
row['country'] = elem.get_text(strip=True)
section = "general_text"
elif elem.name == 'h5':
section = section_lookup[elem.get_text(strip=True)[:6]]
elif elem.name == 'h6':
row['date'] = elem.get_text(strip=True)[27:]
elif elem.name == 'p' and section:
row[section] = f"{row[section]} {elem.get_text(strip=True)}"
if row:
csv_output.writerow(row)
Giving you a data.csv
file starting:
Upvotes: 1