Elsa Strahmbrand
Elsa Strahmbrand

Reputation: 11

How to get div with multiple classes BS4

What is the most efficient way to get divs with BeautifulSoup4 if they have multiple classes?

I have an html structure like this:

<div class='class1 class2 class3 class4'>
  <div class='class5 class6 class7'>
     <div class='comment class14 class15'>
       <div class='date class20 showdate'> 1/10/2017</div>
       <p>comment2</p>
     </div>
     <div class='comment class25 class9'>
       <div class='date class20 showdate'> 7/10/2017</div>
       <p>comment1</p>
     </div>
  </div>
</div>

I want to get div with comment. Usually there is no problem with nested classes, but I don't know why the command:

html = BeautifulSoup(content, "html.parser")
comments = html.find_all("div", {"class":"comment"})

doesn't work. It gives empty array. And I guess this happens because there are a lot of classes, so he looks for div with only comment class and it doesn't exist. How can I find all the comments?

Upvotes: 1

Views: 1351

Answers (1)

user4066647
user4066647

Reputation:

Apparently, the URL that fetches the comments section is different from the original URL that retrieves the main contents.

This is the original URL you gave:

http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best

Behind the scenes, if you record the network log in the network tab of Chrome's developer menu, you'll see a list of all URLs that are sent by the browser. Most of them are for fetching images and scripts. Few relate to other sites such as Facebook or Google (for analytics, etc.). The browser sends another request to this particular site (sparknotes), which gives you the comments section. This is the URL:

http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548

The value for post_id can be found in the web page returned when we request the first URL. It is contained in an input tag which has a hidden attribute.

<input type="hidden" id="postid" name="postid" value="1375724">

You can extract this info from the first web page using a simple soup.find('input', {'id': 'postid'})['value']. Of course, since this identifies the post uniquely, you need not worry about its changing dynamically on each request.

I couldn't find the '1507467541548' value passed to '_' parameter (last parameter of the URL) anywhere in the main page or anywhere in the cookies set by response headers of any of the pages.

However, I went out on a limb and tried to fetch the URL by passing it without the '_' parameter, and it worked.

So, here's the entire script that worked for me:

from bs4 import BeautifulSoup
import requests

req_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive',
    'Host': 'community.sparknotes.com',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best'
    r = s.get(url, headers=req_headers)

    soup = BeautifulSoup(r.content, 'lxml')
    post_id = soup.find('input', {'id': 'postid'})['value']

    # url = 'http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548' # the original URL found in network tab
    url = 'http://community.sparknotes.com/commentlist?post_id={}&page=1&comment_type='.format(post_id) # modified by removing the '_' parameter

    r = s.get(url)

    soup = BeautifulSoup(r.content, 'lxml')
    comments = soup.findAll('div', {'class': 'commentCite'})

    for comment in comments:
        c_name = comment.div.a.text.strip()
        c_date_text = comment.find('div', {'class': 'commentBodyInner'}).text.strip()
        print(c_name, c_date_text)

As you can see, I haven't used headers for the second requests.get. So I'm not sure if it's required at all. You can experiment omitting them in the first request as well. But make sure you use requests, as I haven't tried using urllib. Cookies might play a vital role here.

Upvotes: 1

Related Questions