GhostKU
GhostKU

Reputation: 2108

How can I find one tag between two other tags?

I have a document with structure like this:

<tag1>some_text_1</tag1>
<tag2>text_1</tag2>
<tag3>....</tag3>
<tag2>text_2</tag2>
<tag1>some_text_2</tag1>
<tag2>text_3</tag2>
...

And I need to get all tag2 instances that are after tag1 with some_text_1 and before the next tag1.

Upvotes: 0

Views: 151

Answers (2)

Sandeep
Sandeep

Reputation: 155

from bs4 import BeautifulSoup 

html = '''<tag1>some_text_1</tag1>
        <tag2>text_1</tag2>
    <tag3>....</tag3>
    <tag2>text_2</tag2>
    <tag1>some_text_2</tag1>
    <tag2>text_3</tag2>'''

soup = BeautifulSoup(html,"html.parser")

def findalltags(tag1,tag2,soup):
    # tag1 is between which tag
    # tag2 get info of which tag
    a = soup.find(tag1)
    lis = []
    while True:
        a = a.find_next()
        if(str(a.name) == tag1):
            break
        elif(str(a.name) == tag2):
            lis.append(a)
    return lis
if __name__ == '__main__':
    print findalltags('tag1','tag2',soup)

Hope this will solve the problem but I don't think this is an efficient way. You can use regular expressions if you familiar with them.

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

Your description I need to get all tag2 instances that are after tag1 with some_text_1 and before the next tag2. basically equates to getting the first tag2 after any tag1 with the text some_text_.

So find the tag1's with the certain text and check if the next sibling tag is a tag2, if it is pull the tag2:

html = """<tag1>some_text_1</tag1>
<tag2>text_1</tag2>
<tag3>....</tag3>
<tag2>text_2</tag2>
<tag1>some_text_2</tag1>
<tag2>text_3</tag2>"""


def get_tags_if_preceded_by(soup, tag1, tag2, text):
    for t1 in soup.find_all(tag1, text=text):
        nxt_sib = t1.find_next_sibling()
        if nxt_sib and nxt_sib.name == tag2:
            yield nxt_sib

soup = BeautifulSoup(html, "lxml")

print(list(get_tags_if_preceded_by(soup, "tag1", "tag2", "some_text_1")))

If it does not have to be directly after, it actually makes it simpler, you just need to search for a specific tag2 sibling:

def get_tags_if_preceded_by(soup, tag1, tag2, text):
    for t1 in soup.find_all(tag1, text=text):
        nxt_sib = t1.find_next_sibling(t2)
        if nxt_sib:
            yield nxt_sib

If you really want to find tags between two tags specifically, you can use the logic in this answer.

Upvotes: 1

Related Questions