Reputation: 33
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
Upvotes: 2
Views: 3944
Reputation: 3107
You could use regex to match each part of string.
[-\d]+
the string only have number and -
(?<=\s).*?(?=by)
the string start with blank and end with by(which is begin with author)(?<=by\s).*
the author, the rest of whole stringFull code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-\d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=\s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=by\s).*",paper.text)[0]
print(data)
datas.append(data)
Upvotes: 0
Reputation: 682
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
Upvotes: 1