Reputation: 1079
I'd like to get items from a website with BeautifulSoup
.
<div class="post item">
The target tag is this. The tag has two attrs and white space.
First, I wrote,
roots = soup.find_all("div", "post item")
But, it didn't work. Then I wrote,
html.find_all("div", {'class':['post', 'item']})
I could get items with this,but I am nost sure if this is correct or not. is this code correct?
//// Additional ////
I am sorry,
html.find_all("div", {'class':['post', 'item']})
didn't work properly.
It also extracts class="item"
.
And, I had to write,
soup.find_all("div", class_="post item")
not =
but _=
. Although this doesn't work for me...(>_<)
Target url:
https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb
mycode:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
def main():
target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
html = urlopen(target)
soup = BeautifulSoup(html, "html.parser")
roots = soup.find_all("div", class_="post item")
print(roots)
for root in roots:
print("##################")
if __name__ == '__main__':
main()
Upvotes: 2
Views: 1059
Reputation: 180481
You could use a css select:
soup.select("div.post.item")
Or use class_
.find_all("div", class_="post item")
The docs suggest that *If you want to search for tags that match two or more CSS classes, you should use a CSS selector as per the first example. The give example of both uses:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
Why your code fails why and any of the above solutions would fail has more to do with the fact the class does not exist in the source, it it were there they would all work:
In [6]: r = requests.get("https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb")
In [7]: cont = r.content
In [8]: "post item" in cont
Out[8]: False
If you look at the browser source and do a search you won't find it either. It is generated dynamically and can only be seen if you crack open a developer console or firebug. They also only contain some styling and a react ids so not sure what you expect to pull from it even if you did get them.
If you want to get the html that you see in the browser, you will need something like selenium
Upvotes: 3
Reputation: 474061
First of all, note that class
is a very special multi-valued attribute and it is a common source of confusion in BeautifulSoup
.
html.find_all("div", {'class':['post', 'item']})
This would find all div
elements that have either post
class or item
class (or both, of course). This may produce extra results you don't want to see, assuming you are after div
elements with strictly class="post item"
. If this is the case, you can use a CSS selector:
html.select('div[class="post item"]')
There is also some more information in a similar thread:
Upvotes: 2