Reputation: 2108
I have a document with structure like this:
<tag1>some_text_1</tag1>
<tag2>text_1</tag2>
<tag3>....</tag3>
<tag2>text_2</tag2>
<tag1>some_text_2</tag1>
<tag2>text_3</tag2>
...
And I need to get all tag2
instances that are after tag1
with some_text_1
and before the next tag1
.
Upvotes: 0
Views: 151
Reputation: 155
from bs4 import BeautifulSoup
html = '''<tag1>some_text_1</tag1>
<tag2>text_1</tag2>
<tag3>....</tag3>
<tag2>text_2</tag2>
<tag1>some_text_2</tag1>
<tag2>text_3</tag2>'''
soup = BeautifulSoup(html,"html.parser")
def findalltags(tag1,tag2,soup):
# tag1 is between which tag
# tag2 get info of which tag
a = soup.find(tag1)
lis = []
while True:
a = a.find_next()
if(str(a.name) == tag1):
break
elif(str(a.name) == tag2):
lis.append(a)
return lis
if __name__ == '__main__':
print findalltags('tag1','tag2',soup)
Hope this will solve the problem but I don't think this is an efficient way. You can use regular expressions if you familiar with them.
Upvotes: 0
Reputation: 180401
Your description I need to get all tag2 instances that are after tag1 with some_text_1 and before the next tag2. basically equates to getting the first tag2
after any tag1 with the text some_text_
.
So find the tag1's
with the certain text and check if the next sibling tag is a tag2
, if it is pull the tag2:
html = """<tag1>some_text_1</tag1>
<tag2>text_1</tag2>
<tag3>....</tag3>
<tag2>text_2</tag2>
<tag1>some_text_2</tag1>
<tag2>text_3</tag2>"""
def get_tags_if_preceded_by(soup, tag1, tag2, text):
for t1 in soup.find_all(tag1, text=text):
nxt_sib = t1.find_next_sibling()
if nxt_sib and nxt_sib.name == tag2:
yield nxt_sib
soup = BeautifulSoup(html, "lxml")
print(list(get_tags_if_preceded_by(soup, "tag1", "tag2", "some_text_1")))
If it does not have to be directly after, it actually makes it simpler, you just need to search for a specific tag2
sibling:
def get_tags_if_preceded_by(soup, tag1, tag2, text):
for t1 in soup.find_all(tag1, text=text):
nxt_sib = t1.find_next_sibling(t2)
if nxt_sib:
yield nxt_sib
If you really want to find tags between two tags specifically, you can use the logic in this answer.
Upvotes: 1