Reputation: 488
I have a collection of HTML files which share the following structure:
<h1>ITEM NAME</h1>
<span class="standardLabel">Place of publication: </span>PLACENAME
<br /><span class="standardLabel">Publication dates: </span>DATE
<br /><span class="standardLabel">Notes: </span>NOTES
<br /><span class="standardLabel">Frequency: </span>FREQUENCY
What I want to extract is all the information indicated in BOLD, but I was only able to write a script that captures the "item name" and "place name":
# import packages
from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mobarget\\Google Drive\\ACADEMIA\\10_Data analysis_PhD\\NLI Newspaper DB")
# search information in each file
for infile in os.listdir(directory):
filename=join(directory, infile)
indata=open(filename,"r", encoding="utf-8", errors="ignore")
contents = indata.read()
soup = BeautifulSoup(contents,'html')
newspaper=soup.find('h1')
if newspaper:
print("Title of file no.", str(infile), ": ", newspaper)
place=soup.find("span",{"class":"standardLabel"}).next_sibling
print(place)
else:
continue
The output is:
Title of file no. 1 : <h1>About Town</h1>
Dungannon, Co. Tyrone
Title of file no. 10 : <h1>Amárach: Guth na Gaeltachta</h1>
Dublin, Co. Dublin
Title of file no. 100 : <h1>Belfast Election</h1>
Belfast, Co. Antrim
[etc.]
Any ideas how I could extract the missing data without making the code too redundant?
Upvotes: 1
Views: 312
Reputation: 488
Using the code from Andrej Kesely's answer, I have also added exception handling for missing attributes:
# import packages
from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mobarget\\Google Drive\\ACADEMIA\\10_Data analysis_PhD\\NLI Newspaper DB")
# read downloaded HTML files
for infile in os.listdir(directory):
filename=join(directory, infile)
indata=open(filename,"r", encoding="utf-8", errors="ignore")
contents = indata.read()
soup = BeautifulSoup(contents, 'html.parser')
newspaper=soup.find('h1')
if newspaper:
try:
# read data from tags
title = soup.h1.text
place = soup.select_one('span:contains("Place of publication:")').next_sibling.strip()
dates = soup.select_one('span:contains("Publication dates:")').next_sibling.strip()
notes = soup.select_one('span:contains("Notes:")').next_sibling.strip()
freq = soup.select_one('span:contains("Frequency:")').next_sibling.strip()
# print results
print("Title of file no.", str(infile), ": ", title)
print(place)
print(dates)
print(notes)
print(freq)
# exception handling if attributes are missing
except AttributeError:
print("no data")
else:
continue
Upvotes: 0
Reputation: 195593
You can use CSS selector span:contains("<YOUR STRING>")
to find specific <span>
tag and then do .next_sibling
.
For example:
from bs4 import BeautifulSoup
txt = '''<h1>ITEM NAME</h1>
<span class="standardLabel">Place of publication: </span>PLACENAME
<br /><span class="standardLabel">Publication dates: </span>DATE
<br /><span class="standardLabel">Notes: </span>NOTES
<br /><span class="standardLabel">Frequency: </span>FREQUENCY'''
soup = BeautifulSoup(txt, 'html.parser')
title = soup.h1.text
place = soup.select_one('span:contains("Place of publication:")').next_sibling.strip()
dates = soup.select_one('span:contains("Publication dates:")').next_sibling.strip()
notes = soup.select_one('span:contains("Notes:")').next_sibling.strip()
freq = soup.select_one('span:contains("Frequency:")').next_sibling.strip()
print(title)
print(place)
print(dates)
print(notes)
print(freq)
Prints:
ITEM NAME
PLACENAME
DATE
NOTES
FREQUENCY
Upvotes: 1