miltonjbradley
miltonjbradley

Reputation: 601

Python and BeautifulSoup, find and print dd list items by finding the dt text

here is the html I am trying to extract from:

<dl class=“journal-meta—list”>
<dt>Managing editors(s)</dt>
<dd>
    ::before
    “John Doe”
    ::after
</dd>
<dd>
    ::before
    “Jane Doe”
    ::after
<dd>
<dt>Date</dt>
<dd>
    ::before
    “Jan 2017”
    ::after
</dd>
<dd>
    ::before
    “Feb 2017”
    ::after
<dd>

I am trying to find and print the text in the tags by searching for the contents of the tags. So I want to search for <dt>Managing editors(s)</dt> and get back an array where array[0] = "John Doe", and array[1] = "Jane Doe". I do not want ALL the dd's just the two after the dt.

I can do this:

 editorsList = soup.find("dl", class_="journal-meta--list").getText()

and I get all the text including the dt, but I am trying to parse it by the dt and just get the text of the dd's until the next dt.

I already have BeautifulSoup loaded and working I just don't know how to search for these lists, THANKS!

Upvotes: 0

Views: 2507

Answers (2)

Satish Prakash Garg
Satish Prakash Garg

Reputation: 2233

You can use the following code to achieve the expected result :

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
html_string = """<dl class=“journal-meta—list”>
<dt>Managing editors(s)</dt>
<dd>
    ::before
    “John Doe”
    ::after
</dd>
<dd>
    ::before
    “Jane Doe”
    ::after
<dd>
<dt>Date</dt>
<dd>
    ::before
    “Jan 2017”
    ::after
</dd>
<dd>
    ::before
    “Feb 2017”
    ::after
<dd>"""
soup = BeautifulSoup(html_string, "lxml")


def string_search(text) : 
    pattern = r'{}'.format(text)
    return [''.join([i if ord(i) < 128 else ' ' for i in text.get_text().replace("::before", "").replace("::after", "")]).strip()  for text in soup.find('dt', string=pattern).find_next_siblings('dd')][0:2]

print(string_search('Managing editors(s)'))
print(string_search('Date'))

The result will be :

[u'John Doe', u'Jane Doe']
[u'Jan 2017', u'Feb 2017']

Upvotes: 0

宏杰李
宏杰李

Reputation: 12168

you can locate the dt using string filter, then find all the dd siblings.

In [4]: soup.find('dt', string='Managing editors(s)').find_next_siblings('dd')
Out[4]: 
[<dd>
     ::before
     “John Doe”
     ::after
 </dd>, <dd>
     ::before
     “Jane Doe”
     ::after
 <dd>
 </dd></dd>, <dd>
     ::before
     “Jan 2017”
     ::after
 </dd>, <dd>
     ::before
     “Feb 2017”
     ::after
 <dd></dd></dd>]

Upvotes: 1

Related Questions