Reputation: 601
here is the html I am trying to extract from:
<dl class=“journal-meta—list”>
<dt>Managing editors(s)</dt>
<dd>
::before
“John Doe”
::after
</dd>
<dd>
::before
“Jane Doe”
::after
<dd>
<dt>Date</dt>
<dd>
::before
“Jan 2017”
::after
</dd>
<dd>
::before
“Feb 2017”
::after
<dd>
I am trying to find and print the text in the tags by searching for the contents of the tags. So I want to search for <dt>Managing editors(s)</dt>
and get back an array where array[0] = "John Doe"
, and array[1] = "Jane Doe"
. I do not want ALL the dd's just the two after the dt.
I can do this:
editorsList = soup.find("dl", class_="journal-meta--list").getText()
and I get all the text including the dt
, but I am trying to parse it by the dt
and just get the text of the dd's until the next dt.
I already have BeautifulSoup
loaded and working I just don't know how to search for these lists, THANKS!
Upvotes: 0
Views: 2507
Reputation: 2233
You can use the following code to achieve the expected result :
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
html_string = """<dl class=“journal-meta—list”>
<dt>Managing editors(s)</dt>
<dd>
::before
“John Doe”
::after
</dd>
<dd>
::before
“Jane Doe”
::after
<dd>
<dt>Date</dt>
<dd>
::before
“Jan 2017”
::after
</dd>
<dd>
::before
“Feb 2017”
::after
<dd>"""
soup = BeautifulSoup(html_string, "lxml")
def string_search(text) :
pattern = r'{}'.format(text)
return [''.join([i if ord(i) < 128 else ' ' for i in text.get_text().replace("::before", "").replace("::after", "")]).strip() for text in soup.find('dt', string=pattern).find_next_siblings('dd')][0:2]
print(string_search('Managing editors(s)'))
print(string_search('Date'))
The result will be :
[u'John Doe', u'Jane Doe']
[u'Jan 2017', u'Feb 2017']
Upvotes: 0
Reputation: 12168
you can locate the dt
using string
filter, then find all the dd
siblings.
In [4]: soup.find('dt', string='Managing editors(s)').find_next_siblings('dd')
Out[4]:
[<dd>
::before
“John Doe”
::after
</dd>, <dd>
::before
“Jane Doe”
::after
<dd>
</dd></dd>, <dd>
::before
“Jan 2017”
::after
</dd>, <dd>
::before
“Feb 2017”
::after
<dd></dd></dd>]
Upvotes: 1