Reputation: 91
This code gets the page. My problem is I need to scrape the content of users comments not the number of comments. It is nested inside the number of comments section but I am not sure how I can access the link and parse through and scrape the user comments.
request_list = []
id_list = [0]
for i in range(0,200,25):
response = requests.get("https://www.reddit.com/r/CryptoCurrency/?count="+str(i)+"&after="+str(id_list[-1]), headers = {'User-agent':'No Bot'})
soup = BeautifulSoup(response.content, 'lxml')
request_list.append(soup)
id_list.append(soup.find_all('div', attrs={'data-type': 'link'})[-1]['data-fullname'])
print(i, id_list)
if i%100 == 0:
time.sleep(1)
The code below I tried writing a function that is supposed to access the nested comments but I have no clue.
def extract_comment_contents(request_list):
comment_contents_list = []
for i in request_list:
if response.status_code == 200:
for each in i.find_all('a', attrs={'data-inbound-url': '/r/CryptoCurrency/comments/'}):
comment_contents_list.append(each.text)
else:
print("Call failed at request ", i)
return comment_contents_list
fetch_comment_contents_list = extract_comment_contents(request_list)
print(fetch_comment_contents_list)
Upvotes: 2
Views: 1059
Reputation: 7238
For each thread, you need to send another request to get the comments page. The url for the comments page can be found using soup.find_all('a', class_='bylink comments may-blank')
. This will give all the a
tags that have to url for the comments page. I'll show you one example to get to the comments page.
r = requests.get('https://www.reddit.com/r/CryptoCurrency/?count=0&after=0')
soup = BeautifulSoup(r.text, 'lxml')
for comments_tag in soup.find_all('a', class_='bylink comments may-blank', href=True):
url = comments_tag['href']
r2 = requests.get(url)
soup = BeautifulSoup(r2.text, 'lxml')
# Your job is to parse this soup object and get all the comments.
Upvotes: 2