Reputation: 2534
I'm using bs4 to scrape Product Hunt.
Taking this post as an example, when I scrape it using the below code, the "discussion" section is entirely absent.
res = requests.get('https://producthunt.com/posts/weights-biases')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
pprint.pprint(soup.prettify())
I suspect this has something to do with lazy loading (when you open the page, the "discussion" section takes an extra second or two to appear).
How do I scrape lazy loaded components? Or is this something else entirely?
Upvotes: 0
Views: 102
Reputation: 1331
It seems that some elements of the page are dynamically loaded through Javascript queries.
The requests
library allows you to send queries manually and then you parse the content of the updated page with bs4.
However, from my experience with dynamic webpages, this approach can be really annoying if you have a lot of queries to send.
Generally in these cases it's preferable to use a library that integrates real-time browser simulation. This way, the simulator itself will handle client-server communication and update the page; you will just have to wait for the elements to be loaded and then analyze them safely.
So I suggest you take a look at selenium
or even selenium-requests
if you prefer to keep the requests
'philosophy'.
Upvotes: 1
Reputation: 22440
This is how you can get the comments under discussion. You can always rectify the script to get the concerning replies each thread has got.
import json
import requests
from pprint import pprint
url = 'https://www.producthunt.com/frontend/graphql'
payload = {"operationName":"PostPageCommentsSection","variables":{"commentsListSubjectThreadsCursor":"","commentsThreadRepliesCursor":"","slug":"weights-biases","includeThreadForCommentId":None,"commentsListSubjectThreadsLimit":10},"query":"query PostPageCommentsSection($slug:String$commentsListSubjectThreadsCursor:String=\"\"$commentsListSubjectThreadsLimit:Int!$commentsThreadRepliesCursor:String=\"\"$commentsListSubjectFilter:ThreadFilter$includeThreadForCommentId:ID$excludeThreadForCommentId:ID){post(slug:$slug){id canManage ...PostPageComments __typename}}fragment PostPageComments on Post{_id id slug name ...on Commentable{_id id canComment __typename}...CommentsSubject ...PostReviewable ...UserSubscribed meta{canonicalUrl __typename}__typename}fragment PostReviewable on Post{id slug name canManage featuredAt createdAt disabledWhenScheduled ...on Reviewable{_id id reviewsCount reviewsRating isHunter isMaker viewerReview{_id id sentiment comment{id body __typename}__typename}...on Commentable{canComment commentsCount __typename}__typename}meta{canonicalUrl __typename}__typename}fragment CommentsSubject on Commentable{_id id ...CommentsListSubject __typename}fragment CommentsListSubject on Commentable{_id id threads(first:$commentsListSubjectThreadsLimit after:$commentsListSubjectThreadsCursor filter:$commentsListSubjectFilter include_comment_id:$includeThreadForCommentId exclude_comment_id:$excludeThreadForCommentId){edges{node{_id id ...CommentThread __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}__typename}fragment CommentThread on Comment{_id id isSticky replies(first:5 after:$commentsThreadRepliesCursor allForCommentId:$includeThreadForCommentId){edges{node{_id id ...Comment __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}...Comment __typename}fragment Comment on Comment{_id id badges body bodyHtml canEdit canReply canDestroy createdAt isHidden path repliesCount subject{_id id ...on Commentable{_id id __typename}__typename}user{_id id headline name firstName username headline ...UserSpotlight __typename}poll{...PollFragment __typename}review{id sentiment __typename}...CommentVote ...FacebookShareButtonFragment __typename}fragment CommentVote on Comment{_id id ...on Votable{_id id hasVoted votesCount __typename}__typename}fragment FacebookShareButtonFragment on Shareable{id url __typename}fragment UserSpotlight on User{_id id headline name username ...UserImage __typename}fragment UserImage on User{_id id name username avatar headline isViewer ...KarmaBadge __typename}fragment KarmaBadge on User{karmaBadge{kind score __typename}__typename}fragment PollFragment on Poll{id answersCount hasAnswered options{id text imageUuid answersCount answersPercent hasAnswered __typename}__typename}fragment UserSubscribed on Subscribable{_id id isSubscribed __typename}"}
r = requests.post(url,json=payload)
for item in r.json()['data']['post']['threads']['edges']:
pprint(item['node']['body'])
Output at this moment:
('Looks like such a powerful tool for extracting performance insights! '
'Absolutely love the documentation feature, awesome work!')
('This is awesome and so Any discounts or special pricing for '
'researchers/students/non-professionals?')
'Amazing. I think this is very helpful tools for us. Keep it up & go ahead.'
('<p>This simple system of record automatically saves logs from every '
'experiment, making it easy to look over the history of your progress and '
'compare new models with existing baselines.</p>\n'
'Pros: <p>Easy, fast, and lightweight experiment tracking</p>\n'
'Cons: <p>Only available for Python projects</p>')
('Very cool! I hacked together something similar but much more basic for '
"personal use and always wondered why TensorBoard didn't solve this problem. "
'I just wish this was open source! :) P.S. awesome use of the parallel '
'co-ordinates d3.js chart - great idea to apply it to experiment '
'configurations!')
Upvotes: 0