topcat
topcat

Reputation: 51

Using BeautifulSoup to extract specific dl and dd list elements

My first time posting. I am using BeautifulSoup 4 and python 2.7 (pycharm). I have a webpage containing elements and I need to extract specific elements where the tags are either 'Salary:' or 'Date:', the page contains multiple lists .

The problem: I cannot seem to identify and extract specific text. I have searched this site and tried without success.

Example html:

<dl><dt>Date:</dt><dd>13 September 2015</dd><dt>Salary:</dt><dd>Starting at £40,130 per annum.</dd></dl><dl><dt>Date:</dt><dd>15 December 2015</dd><dt>Salary:</dt><dd>Starting at £22,460 per annum.</dd></dl><dl><dt>Date:</dt><dd>10 January 2014</dd><dt>Salary:</dt><dd>Starting at £18,160 per annum.</dd></dl>

Code which I have tried without success:

r = requests.get("http://www.mywebsite.com/test.html")
soup = BeautifulSoup(r.content, "html.parser")
dl_data = soup.find_all("dl")
for dlitem in dl_data: 
    print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
    print dlitem.find("dt",text="Salary:").parent.findNext("dd").contents[0]

Expected Result:

13 September 2015
15 December 2015
10 January 2014
Starting at £40,130 per annum.
Starting at £22,460 per annum.
Starting at £18,160 per annum.

Actual Result:

print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
AttributeError: 'NoneType' object has no attribute 'parent'

I have tried numerous variations of this code and gone round in circles, I figured out how to print out all dd elements to screen, just not specific dd elements!

Thanks

Upvotes: 4

Views: 12183

Answers (3)

Leo103
Leo103

Reputation: 871

A more robust solution would be to make a dict of (key,value) pairs of all (dt,dd) elements in the dl. Then select the desired fields from the dict.


How to convert a 'dl' element to a dict

Data in some class "obj":

html = """
    <dl class="obj">
      <dt>Time</dt> <dd>10:00</dd>
      <dt>Temp</dt> <dd>20.5°C</dd>
    </dl>  
       """

Save all the "dt" aand "dl" then zip them to form a dict:

def get_dl(soup):
    keys, values = [], []
    for dl in soup.findAll("dl", {"class": "obj"}):
        for dt in dl.findAll("dt"):
            keys.append(dt.text.strip())
        for dd in dl.findAll("dd"):
            values.append(dd.text.strip())
    return dict(zip(keys, values))

soup = BeautifulSoup(html, features="html.parser")
dl_dict = get_dl(soup)

Outputs:

{'Time': '10:00', 'Temp': '20.5°C'}

Upvotes: 3

Doc
Doc

Reputation: 57

I guess it works if you just omit the .parent in your code. At least this worked for my problem which is very similar to yours.

Here's my html, where order of the <dt> is not guaranteed:

<dl>
 <dt>Time</dt><dd>10:05:02</dd>
 <dt>Temp</dt><dd>20.5°C</dd>
</dl>

I'm accessing the values successfully with the following code:

 time = at_tl.find("dt",text="Time").findNext("dd").string
 temp = at_tl.find("dt",text="Temp").findNext("dd").string

Upvotes: 2

mmachine
mmachine

Reputation: 926

If order is not important just make some changes:

...
dl_data = soup.find_all("dd")
for dlitem in dl_data:
    print dlitem.string

Result:

13 September 2015
Starting at £40,130 per annum.
15 December 2015
Starting at £22,460 per annum.
10 January 2014
Starting at £18,160 per annum.

For your latest request:

for item in list(zip(soup.find_all("dd")[0::3],soup.find_all("dd")[2::3])):
    date, salary = item
    print ', '.join([date.string, salary.string])

Output:

13 September 2015, 100
14 September 2015, 200

Upvotes: 7

Related Questions