BDCBin
BDCBin

Reputation: 23

Find <a> tags between two <i> tags with beautifulsoup

I'm using python and beautifulsoup. I have a html page which looks like this:

<i>Hello<\i>
<a href="www.google.com"> Google <\a>
<i>Bye<\i>
<a href="www.google.com"> Google2 <\a>
<i>Hello<\i>
<a href="www.google.com"> Google3 <\a>
<i>Bye<\i>

I would like to get all the "a" tags text(which I know how to do, I just don't know how to get to them) between the Hello and Bye tags but not between the Bye and Hello tags. Would it be possible with beautiful soup and python?

Upvotes: 1

Views: 1044

Answers (3)

brennan
brennan

Reputation: 3493

You could use a mix of BeautifulSoup and regex. Here regex is used to grab everything between the limit tags, then BeautifulSoup is used to extract the anchor tags.

from bs4 import BeautifulSoup
import re

excerpts = re.findall(r'<i>Hello<\\i>(.*?)<i>Bye<\\i>', html, re.DOTALL)

for e in excerpts:
    soup = BeautifulSoup(e)
    for link in soup.findAll('a'):
        print(link)

Output:

<a href="www.google.com"> Google </a>
<a href="www.google.com"> Google3 </a>

Upvotes: 1

Bill Bell
Bill Bell

Reputation: 21643

I corrected your HTML slightly. (Notice that the backslashes should be slashes.)

To do this, first find the 'Hello' strings. Call one of these strings s in the for-loop. Then what you want is s.findParent().findNextSibling().

I display s, s.findParent() and s.findParent().findNextSibling() to show you how I went about constructing what you needed from these strings.

>>> import bs4
>>> HTML = '''\
... <i>Hello</i>
... <a href="www.google.com"> Google </a>
... <i>Bye</i>
... <a href="www.google.com"> Google2 </a>
... <i>Hello</i>
... <a href="www.google.com"> Google3 </a>
... <i>Bye</i>
... '''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> for s in soup.find_all(string='Hello'):
...     s, s.findParent(), s.findParent().findNextSibling()
...     
('Hello', <i>Hello</i>, <a href="www.google.com"> Google </a>)
('Hello', <i>Hello</i>, <a href="www.google.com"> Google3 </a>)

Upvotes: 3

lincr
lincr

Reputation: 1653

Perhaps you can use re module. Reference see Regular Expression Howto for py2

str_tags = """
<i>Hello<\i>
<a href="www.google.com"> Google <\a>
<i>Bye<\i>
<a href="www.google.com"> Google2 <\a>
<i>Hello<\i>
<a href="www.google.com"> Google3 <\a>
<i>Bye<\i>
"""

import re
str_re = re.compile(r".*Hello.*\s<a[^>]*>([\w\s]+)<\a>\s<i>Bye")
content_lst = str_re.findall(str_tags)
if content_lst:
    print(content_lst)
else:
    print("Not found")

Output

[' Google ', ' Google3 ']

Note this method depends strongly on what your html looks like. For explanation about the above code, please also refer to the first link.

Upvotes: 0

Related Questions