Reputation: 611
I have an HTML file and I want to take grab the text from this block, shown here:
<strong class="fullname js-action-profile-name">User Name</strong>
<span>‏</span>
<span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>
I want it to display as:
User Name
@UserName
How would I do this using Beautiful Soup?
Upvotes: 3
Views: 239
Reputation: 817
from bs4 import BeautifulSoup
html = '''<strong class="fullname js-action-profile-name">User Name</strong>
<span>‏</span>
<span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>'''
soup = BeautifulSoup(html)
username = soup.find(attrs={'class':'username js-action-profile-name'}).text
fullname = soup.find(attrs={'class':'fullname js-action-profile-name'}).text
print fullname
print username
Outputs:
User Name
@UserName
Two notes:
Use bs4
if you're starting something new / just learning BS.
You will probably be loading your HTML from an external file, so replace html
with a file object.
Upvotes: 1
Reputation: 4709
This assumes index.html contains the markup from the question:
import BeautifulSoup
def displayUserInfo():
soup = BeautifulSoup.BeautifulSoup(open("index.html"))
fullname_ele = soup.find(attrs={"class": "fullname js-action-profile-name"})
fullname = fullname_ele.contents[0]
print fullname
username_ele = soup.find(attrs={"class": "username js-action-profile-name"})
username = ""
for child in username_ele.findChildren():
username += child.contents[0]
print username
if __name__ == '__main__':
displayUserInfo()
# prints:
# User Name
# @UserName
Upvotes: 0
Reputation: 11996
Use the "text" attribute. Example:
>>> b = BeautifulSoup.BeautifulStoneSoup(open('/tmp/x.html'), convertEntities=BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES)
>>> print b.find(attrs={"id": "container"}).text
User Name@UserName
In x.html I have a div containing the html you provided, with an id of "container". Note that I convert the to \u200f with BeautifulStoneSoup. To insert a newline (that wouldn't be introduced by a browser) just replace u'\u200f' with '\n'.
Upvotes: 1