Reputation: 23
I'm using python and beautifulsoup. I have a html page which looks like this:
<i>Hello<\i>
<a href="www.google.com"> Google <\a>
<i>Bye<\i>
<a href="www.google.com"> Google2 <\a>
<i>Hello<\i>
<a href="www.google.com"> Google3 <\a>
<i>Bye<\i>
I would like to get all the "a" tags text(which I know how to do, I just don't know how to get to them) between the Hello and Bye tags but not between the Bye and Hello tags. Would it be possible with beautiful soup and python?
Upvotes: 1
Views: 1044
Reputation: 3493
You could use a mix of BeautifulSoup and regex. Here regex is used to grab everything between the limit tags, then BeautifulSoup is used to extract the anchor tags.
from bs4 import BeautifulSoup
import re
excerpts = re.findall(r'<i>Hello<\\i>(.*?)<i>Bye<\\i>', html, re.DOTALL)
for e in excerpts:
soup = BeautifulSoup(e)
for link in soup.findAll('a'):
print(link)
Output:
<a href="www.google.com"> Google </a>
<a href="www.google.com"> Google3 </a>
Upvotes: 1
Reputation: 21643
I corrected your HTML slightly. (Notice that the backslashes should be slashes.)
To do this, first find the 'Hello' strings. Call one of these strings s
in the for-loop. Then what you want is s.findParent().findNextSibling()
.
I display s
, s.findParent()
and s.findParent().findNextSibling()
to show you how I went about constructing what you needed from these strings.
>>> import bs4
>>> HTML = '''\
... <i>Hello</i>
... <a href="www.google.com"> Google </a>
... <i>Bye</i>
... <a href="www.google.com"> Google2 </a>
... <i>Hello</i>
... <a href="www.google.com"> Google3 </a>
... <i>Bye</i>
... '''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> for s in soup.find_all(string='Hello'):
... s, s.findParent(), s.findParent().findNextSibling()
...
('Hello', <i>Hello</i>, <a href="www.google.com"> Google </a>)
('Hello', <i>Hello</i>, <a href="www.google.com"> Google3 </a>)
Upvotes: 3
Reputation: 1653
Perhaps you can use re
module. Reference see Regular Expression Howto for py2
str_tags = """
<i>Hello<\i>
<a href="www.google.com"> Google <\a>
<i>Bye<\i>
<a href="www.google.com"> Google2 <\a>
<i>Hello<\i>
<a href="www.google.com"> Google3 <\a>
<i>Bye<\i>
"""
import re
str_re = re.compile(r".*Hello.*\s<a[^>]*>([\w\s]+)<\a>\s<i>Bye")
content_lst = str_re.findall(str_tags)
if content_lst:
print(content_lst)
else:
print("Not found")
Output
[' Google ', ' Google3 ']
Note this method depends strongly on what your html looks like. For explanation about the above code, please also refer to the first link.
Upvotes: 0