Reputation: 930
I'm trying to get twitter profile name using profile url with beautifulsoup in python, but whatever html tags I use, I'm not able to get the name. What html tags can I use to get the profile name from twitter user page ?
url = 'https://twitter.com/twitterID'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
# Find the display name
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
if name_element != None:
display_name = name_element.text
else:
display_name = "error"
Upvotes: 0
Views: 722
Reputation: 4710
html = requests.get(url).text
Twitter profile links cannot be scraped simply through requests
like this since the contents of the profile pages are loaded with JavaScript [via the API], as you might notice if you previewed the source HTML on you browser's network logs or checked the fetched HTML.
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
display_name = name_element.text
Even after fetching the right HTML, calling .find
like that will result in display_name
containing 'To view keyboard shortcuts, press question mark'
or 'Don’t miss what’s happening'
because there are 67 span
tags with that class. Calling .find_all(....)[6]
might work but it's definitely not a reliable approach. You should instead consider using .select
with CSS selectors to target the name.
name_element = soup.select_one('div[data-testid="UserName"] span>span')
The .find
equivalent would be
# name_element = soup.find('div', {'data-testid': 'UserName'}).span.span ## too many weak points
name_element = soup.find(lambda t: t.name == t.parent.name == 'span' and t.find_parent('div', {'data-testid': 'UserName'}))
but I find .select
much more convenient.
Using two functions I often use for scraping - linkToSoup_selenium
(which takes a URL and returns a BeautifulSoup
object after using selenium and bs4
to load and parse the HTML), and selectForList
(which extracts details from bs4 Tags based on selectors [like in the selectors
dictionary below])
Setup:
# imports ## PASTE FROM https://pastebin.com/kEC9gPC8
# def linkToSoup_selenium... ## PASTE FROM https://pastebin.com/kEC9gPC8
# def selectForList... ## PASTE FROM https://pastebin.com/ZnZ7xM6u
## JUST FOR REDUCING WHITESPACE - not important for extracting information ##
def miniStr(o): return ' '.join(w for w in str(o).split() if w)
profileUrls = ['https://twitter.com/twitterID', 'https://twitter.com/jokowi', 'https://twitter.com/sep_colin']
# ptSel = 'article[data-testid="tweet"]:has(div[data-testid="socialContext"])'
# ptuaSel = 'div[data-testid="User-Names"]>div>div>div>a'
selectors = {
'og_url': ('meta[property="og\:url"][content]', 'content'),
'name_span': 'div[data-testid="UserName"] span>span',
'name_div': 'div[data-testid="UserName"]',
# 'handle': 'div[data-testid="UserName"]>div>div>div+div',
'description': 'div[data-testid="UserDescription"]',
# 'location': 'span[data-testid="UserLocation"]>span',
# 'url_href': ('a[data-testid="UserUrl"][href]', 'href'),
# 'url_text': 'a[data-testid="UserUrl"]>span',
# 'birthday': 'span[data-testid="UserBirthdate"]',
# 'joined': 'span[data-testid="UserJoinDate"]>span',
# 'following': 'div[data-testid="UserName"]~div>div>a[href$="\/following"]',
# 'followers': 'div[data-testid="UserName"]~div>div>a[href$="\/followers"]',
# 'pinnedTweet_uname': f'{ptSel} div[data-testid="User-Names"] span>span',
# 'pinnedTweet_handl': f'{ptSel} {ptuaSel}:not([aria-label])',
# 'pinnedTweet_pDate': (f'{ptSel} {ptuaSel}[aria-label]', 'aria-label'),
# 'pinnedTweet_text': f'{ptSel} div[data-testid="tweetText"]',
}
def scrapeTwitterProfile(profileUrl, selRef=selectors):
soup = linkToSoup_selenium(profileUrl, ecx=[
'div[data-testid="UserDescription"]' # wait for user description to load
# 'article[data-testid="tweet"]' # wait for tweets to load
], tmout=3, by_method='css', returnErr=True)
if not isinstance(soup, str): return selectForList(soup, selRef)
return {'Error': f'failed to scrape {profileUrl} - {soup}'}
Setting returnErr=True
returns the error message (a string instead of the BeautifulSoup object) if anything goes wrong. ecx
should be set based on which part/s you want to load (it's a list so it can have multiple selectors). tmout
doesn't have to be passed (default is 25sec), but if it is, it should be adjusted for the other arguments and your own device and browser speeds - on my browser, tmo=0.01
is enough to load user details, but loading the first tweets takes at least tmo=2
.
I wrote scrapeTwitterProfile
mostly so that I could get tuDets
[below] in one line. The for-loop after that is just for printing the results.
tuDets = [scrapeTwitterProfile(url) for url in profileUrls]
for url, d in zip(profileUrls, tuDets):
print('\nFrom', url)
for k, v in d.items(): print(f'\t{k}: {miniStr(v)}')
snscrape has a module for Twitter that can be used to access Twitter data without having registered up for the API yourself. The example below prints a similar output to the previous example, but is faster.
# import snscrape.modules.twitter as sns_twitter
# def miniStr(o): return ' '.join(w for w in str(o).split() if w)
# profileIDs = [url.split('twitter.com/', 1)[-1].split('/')[0] for url in profileUrls]
profileIDs = ['twitterID', 'jokowi', 'sep_colin']
keysList = ['username', 'id', 'displayname', 'description', 'url']
for pid in profileIDs:
tusRes, defVal = sns_twitter.TwitterUserScraper(pid).entity, 'no such attribute'
print('\nfor ID', pid)
for k in keysList: print('\t', k, ':', miniStr(getattr(tusRes, k, defVal)))
You can get most of the attributes in .entity
with .__dict__
or print them all all with something like
print('\n'.join(f'{a}: {miniStr(v)}' for a, v in [
(n, getattr(tusRes, n)) for n in dir(tusRes)
] if not (a[:1] == '_' or callable(v))))
See this example from this tutorial if you are interested in scraping tweets as well.
Upvotes: 1