Reputation: 2500
<div class="name">
<strong>
<a target="_blank" href="/page3.html">
SOME_Name_TEXT
</a>
</strong>
</div>
<div class="data">
<img src="/page1/page2/Images/pic.png" height="13" width="13">
SOME_Data_TEXT
</div>
I have a html page with the different classes. I am able to extract class "name" and "data" from different classes using beautifulsoup
myName = soup.findAll("div", {"class" : "name"})
myData = soup.findAll("div", {"class" : "data"})
But the result I get when I run the script and print myName and myData elements respectively:
  SOME_Name_TEXT(as a link)
 SOME_Data_TEXT
The problem is I dont want Â. This is due to
2
's in first and one in second.
I just want the result as:
SOME_Name_TEXT(as a link)
SOME_Data_TEXT
In the first part link with the "SOME_Name_TEXT" is required. Image in data part is not needed, I want just the raw text in second part i.e "SOME_Data_TEXT". I tried doing it using str.split(). How can I get the exact results?
Upvotes: 0
Views: 280
Reputation: 2500
Finally solved it with the help of other questions:
For the first part i.e
<div class="name">
<strong>
<a target="_blank" href="/page3.html">
SOME_Name_TEXT
</a>
</strong>
</div>
Let this block is in x, then I used print x.findNext('strong')
And for 2nd part i.e.
<div class="data">
<img src="/page1/page2/Images/pic.png" height="13" width="13">
SOME_Data_TEXT
</div>
I did like:
tmp = x.findNext('img')
print tmp.get_text().strip()
Upvotes: 0
Reputation: 602
You'll have to do a unicode replace to remove the
because BS converts HTML entities to unicode characters.
Edit:
soup.prettify(formatter=lambda x: x.replace(u'\xa0', ''))
Other options: For myData, to just get the text, do this:
myData = soup.findAll("div", {"class" : "data"})[0].find('img').contents[0].strip()
and for myName:
myName = repr(soup.findAll("div", {"class" : "name"})[0].find('a'))
myName = re.sub(' ', '', myName)
does that work for you?
Upvotes: 0
Reputation: 2619
Since you do not want  , you can do something like this:
myName = soup.findAll("div", {"class" : "name"})
myData = soup.findAll("div", {"class" : "data"})
if(myName && !soup.findAll(text=" "))
{
System.out.print(myName);
}
or 2nd approach, here str is your myName:
str= " hey how are you doing"
str.decode("utf-8");
str = str.replace(" ", "")
print str
Upvotes: 1