Reputation: 51
My first time posting. I am using BeautifulSoup 4 and python 2.7 (pycharm). I have a webpage containing elements and I need to extract specific elements where the tags are either 'Salary:' or 'Date:', the page contains multiple lists .
The problem: I cannot seem to identify and extract specific text. I have searched this site and tried without success.
Example html:
<dl><dt>Date:</dt><dd>13 September 2015</dd><dt>Salary:</dt><dd>Starting at £40,130 per annum.</dd></dl><dl><dt>Date:</dt><dd>15 December 2015</dd><dt>Salary:</dt><dd>Starting at £22,460 per annum.</dd></dl><dl><dt>Date:</dt><dd>10 January 2014</dd><dt>Salary:</dt><dd>Starting at £18,160 per annum.</dd></dl>
Code which I have tried without success:
r = requests.get("http://www.mywebsite.com/test.html")
soup = BeautifulSoup(r.content, "html.parser")
dl_data = soup.find_all("dl")
for dlitem in dl_data:
print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
print dlitem.find("dt",text="Salary:").parent.findNext("dd").contents[0]
Expected Result:
13 September 2015
15 December 2015
10 January 2014
Starting at £40,130 per annum.
Starting at £22,460 per annum.
Starting at £18,160 per annum.
Actual Result:
print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
AttributeError: 'NoneType' object has no attribute 'parent'
I have tried numerous variations of this code and gone round in circles, I figured out how to print out all dd elements to screen, just not specific dd elements!
Thanks
Upvotes: 4
Views: 12183
Reputation: 871
A more robust solution would be to make a dict
of (key,value) pairs of all (dt,dd) elements in the dl. Then select the desired fields from the dict.
Data in some class "obj":
html = """
<dl class="obj">
<dt>Time</dt> <dd>10:00</dd>
<dt>Temp</dt> <dd>20.5°C</dd>
</dl>
"""
Save all the "dt" aand "dl" then zip them to form a dict:
def get_dl(soup):
keys, values = [], []
for dl in soup.findAll("dl", {"class": "obj"}):
for dt in dl.findAll("dt"):
keys.append(dt.text.strip())
for dd in dl.findAll("dd"):
values.append(dd.text.strip())
return dict(zip(keys, values))
soup = BeautifulSoup(html, features="html.parser")
dl_dict = get_dl(soup)
Outputs:
{'Time': '10:00', 'Temp': '20.5°C'}
Upvotes: 3
Reputation: 57
I guess it works if you just omit the .parent
in your code. At least this worked for my problem which is very similar to yours.
Here's my html, where order of the <dt>
is not guaranteed:
<dl>
<dt>Time</dt><dd>10:05:02</dd>
<dt>Temp</dt><dd>20.5°C</dd>
</dl>
I'm accessing the values successfully with the following code:
time = at_tl.find("dt",text="Time").findNext("dd").string
temp = at_tl.find("dt",text="Temp").findNext("dd").string
Upvotes: 2
Reputation: 926
If order is not important just make some changes:
...
dl_data = soup.find_all("dd")
for dlitem in dl_data:
print dlitem.string
Result:
13 September 2015
Starting at £40,130 per annum.
15 December 2015
Starting at £22,460 per annum.
10 January 2014
Starting at £18,160 per annum.
For your latest request:
for item in list(zip(soup.find_all("dd")[0::3],soup.find_all("dd")[2::3])):
date, salary = item
print ', '.join([date.string, salary.string])
Output:
13 September 2015, 100
14 September 2015, 200
Upvotes: 7