Reputation: 154
I'm trying to print only two fields from two functions. The both functions take the same url but produce different results. The first function get_names()
prints the name of different users. The second function get_badges()
produces the number of badges connected to concerning users. As the number of badges is not always present in every users, I used zip_longest()
so that if any user doesn't have any badges, the function will print None
. However, the problem is get_badges()
function gives me wrong results when it encounters any user not having any badges.
I've tried with:
import requests
from bs4 import BeautifulSoup
from itertools import zip_longest
url = 'https://stackoverflow.com/questions/tagged/web-scraping'
def get_names(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".user-details > a"):
yield item.text
def get_badges(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".badgecount"):
yield item.text
if __name__ == '__main__':
for elem in zip_longest(get_names(url),get_badges(url)):
print(elem)
How can I let the two functions produce accurate results?
Upvotes: 1
Views: 87
Reputation: 6663
That's impossible to correlate those two lists!
You have no way of finding the correspondence between the user's name and the number of badge, look, if you do:
print(list(get_names(url)))
print(list(get_badges(url)))
You'll get:
['Arkadi w', 'MITHU', 'Mohamed Suhail Irfan Khazi', 'Kevin Walsh', 'lowpeasant', 'vivekh99', 'Nico Gandolfo', ... ]
['7', '4', '18', '2', '2', '2', '1', ...]
But if you zip those lists, th 2 badges of 'vivekh99' will be attributed to 'lowpeasant', who has no badges!
The only way I can imagine is to change your get_badges
method to return a tuple of the form (name, badges), or a dictionary. Something like that:
def get_badges(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".user-details"):
yield (item.find('a').text, [s.text for s in item.find_all('span', { "class" : "badgecount" })])
Upvotes: 1
Reputation: 588
I would approach this differently.
First, you request the same URL twice. So that logic would need to be removed from both functions and placed elsewhere to prevent this from happening. Instead of trying to zip both results together I would iterate over user details and parse those in different functions, removing zip altogether as this is causing your inconsistencies.
basically in pseudocode
def _retrieve_user(question):
return "userthing"
def _retrieve_badgecount(question):
# Handle no badge-count
return 1 or 0
def parse_question(question):
# get user info
# get badecount
return _retrieve_user(question=question),
_retrieve_badgecount(question=question)
def main(url):
# get url
# make soup
# iterate over questions using beautifullsoup siblings
return [parse_question(q) for q in ("question_1", "question_2")]
This should have the single responsibility concern covered and has no repetition.
Good luck!
Upvotes: 0
Reputation: 195478
As stated above, you need something that "connects" results from get_names()
and get_badges()
. In your code, there's nothing like it - so the results will be mismatched in zip.
In this code I use CSS selector .user-details
as common element between the two functions. In your code you can have common element in the form of user name, or user id, etc. and return a dictionary/tuple from each function:
import requests
from bs4 import BeautifulSoup
url = 'https://stackoverflow.com/questions/tagged/web-scraping'
def get_names(soup):
for item in soup.select(".user-details > a"):
yield item.text
def get_badges(soup):
for item in soup.select(".user-details"):
gold = item.select_one('.badge1 + .badgecount')
silver = item.select_one('.badge2 + .badgecount')
bronze = item.select_one('.badge3 + .badgecount')
yield [int(gold.text) if gold else 0,
int(silver.text) if silver else 0,
int(bronze.text) if bronze else 0]
if __name__ == '__main__':
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
print('{: <30}{: >5}{: >5}{: >5}'.format('Name', 'G', 'S', 'B'))
print('-' * 45)
for name, badges in zip(get_names(soup), get_badges(soup)):
print('{: <30}{}'.format(name, ''.join('{: >5}'.format(b) for b in badges)))
Prints:
Name G S B
---------------------------------------------
Arkadi w 0 0 7
MITHU 0 4 18
Mohamed Suhail Irfan Khazi 0 0 2
Kevin Walsh 0 0 2
lowpeasant 0 0 0
vivekh99 0 0 2
Nico Gandolfo 0 0 1
Sam Edeus 0 0 2
Tab Key 0 0 7
Ion Aag 0 0 5
... and so on.
Upvotes: 2