Reputation: 53
<p>This is the first paragraph with some details</p>
<p><a href = "user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href = "user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
!----There is n number of data like this-----!
This is the structure of my html. My aim is to extract the users and their contents. In this case it should print all the contents between two 'a' tags. This is just an example of my structure, but in real html, i have different types of tags between two 'a' tags. I need a solution to iterate all the tags below a 'a' tag till it finds another 'a' tag. Hope thats clear.
The code which i tried is :
for i in soup.findAll('a'):
while(i.nextSibling.name!='a'):
print i.nextSibling
I returns me an infinite loop. So if anyone has idea how i can solve this issue please share it with me.
Expected output is :
username is : user1
text is : This is opening contents for user1 This is the contents from user1 This is more content from user1
username is : user2
text is : This is opening contents for user2 This is the contents from user2 This is more content from user2
and so on......
Upvotes: 1
Views: 1796
Reputation: 5808
Try this:
from bs4 import BeautifulSoup
html="""
<p>This is the first paragraph with some details</p>
<p><a href="user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href="user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
"""
soup = BeautifulSoup(html)
for i in soup.find_all('a'):
print 'name:', i.text
for s in [i, i.parent.find_next_sibling()]:
while s <> None:
if s.find('a') <> None:
break
print 'contents:', s.text
s = s.find_next_sibling()
(Note: find_all
is the recommended name for findAll
, it may not work in older soups. Same with find_next_sibling
.)
Upvotes: 0
Reputation: 36262
One option is to search for every <a>
tag with find_all()
and for each link use find_all_next()
to search <font>
tags that have the contents for each user. The following script extracts the user name and its contents and save both as a tuple inside a list:
from bs4 import BeautifulSoup
l = []
soup = BeautifulSoup(open('htmlfile'))
for link in soup.find_all('a'):
s = []
for elem in link.find_all_next(['font', 'a']):
if elem.name == 'a':
break
s.append(elem.string)
user_content = ' '.join(s)
l.append((link.string, user_content))
It yields:
[('user1', 'This is the contents from user1 This is more content from user1'),
('user2', 'This is the contents from user2 This is more content from user2')]
Upvotes: 1