mac
mac

Reputation: 930

How to get twitter profile name using python BeautifulSoup module?

I'm trying to get twitter profile name using profile url with beautifulsoup in python, but whatever html tags I use, I'm not able to get the name. What html tags can I use to get the profile name from twitter user page ?

url = 'https://twitter.com/twitterID'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

# Find the display name
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
if name_element != None:
    display_name = name_element.text
else:
    display_name = "error"

Upvotes: 0

Views: 722

Answers (1)

Driftr95
Driftr95

Reputation: 4710

html = requests.get(url).text

Twitter profile links cannot be scraped simply through requests like this since the contents of the profile pages are loaded with JavaScript [via the API], as you might notice if you previewed the source HTML on you browser's network logs or checked the fetched HTML.


name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
    display_name = name_element.text

Even after fetching the right HTML, calling .find like that will result in display_name containing 'To view keyboard shortcuts, press question mark' or 'Don’t miss what’s happening' because there are 67 span tags with that class. Calling .find_all(....)[6] might work but it's definitely not a reliable approach. You should instead consider using .select with CSS selectors to target the name.

name_element = soup.select_one('div[data-testid="UserName"] span>span')

The .find equivalent would be

# name_element = soup.find('div', {'data-testid': 'UserName'}).span.span ## too many weak points
name_element = soup.find(lambda t: t.name == t.parent.name == 'span' and t.find_parent('div', {'data-testid': 'UserName'}))

but I find .select much more convenient.


Selenium Example

Using two functions I often use for scraping - linkToSoup_selenium (which takes a URL and returns a BeautifulSoup object after using selenium and bs4 to load and parse the HTML), and selectForList (which extracts details from bs4 Tags based on selectors [like in the selectors dictionary below])

Setup:

# imports ## PASTE FROM https://pastebin.com/kEC9gPC8
# def linkToSoup_selenium... ## PASTE FROM https://pastebin.com/kEC9gPC8
# def selectForList... ## PASTE FROM https://pastebin.com/ZnZ7xM6u 

## JUST FOR REDUCING WHITESPACE - not important for extracting information ##
def miniStr(o): return ' '.join(w for w in str(o).split() if w)

profileUrls = ['https://twitter.com/twitterID', 'https://twitter.com/jokowi', 'https://twitter.com/sep_colin']

# ptSel = 'article[data-testid="tweet"]:has(div[data-testid="socialContext"])'
# ptuaSel = 'div[data-testid="User-Names"]>div>div>div>a'
selectors = {
    'og_url': ('meta[property="og\:url"][content]', 'content'),
    'name_span': 'div[data-testid="UserName"] span>span',
    'name_div': 'div[data-testid="UserName"]',
    # 'handle': 'div[data-testid="UserName"]>div>div>div+div',
    'description': 'div[data-testid="UserDescription"]',
    # 'location': 'span[data-testid="UserLocation"]>span',
    # 'url_href': ('a[data-testid="UserUrl"][href]', 'href'),
    # 'url_text': 'a[data-testid="UserUrl"]>span',
    # 'birthday': 'span[data-testid="UserBirthdate"]',
    # 'joined': 'span[data-testid="UserJoinDate"]>span',
    # 'following': 'div[data-testid="UserName"]~div>div>a[href$="\/following"]',
    # 'followers': 'div[data-testid="UserName"]~div>div>a[href$="\/followers"]',
    # 'pinnedTweet_uname': f'{ptSel} div[data-testid="User-Names"] span>span',
    # 'pinnedTweet_handl': f'{ptSel} {ptuaSel}:not([aria-label])',
    # 'pinnedTweet_pDate': (f'{ptSel} {ptuaSel}[aria-label]', 'aria-label'),
    # 'pinnedTweet_text': f'{ptSel} div[data-testid="tweetText"]',
}


def scrapeTwitterProfile(profileUrl, selRef=selectors):
    soup = linkToSoup_selenium(profileUrl, ecx=[
        'div[data-testid="UserDescription"]'  # wait for user description to load
        # 'article[data-testid="tweet"]'  # wait for tweets to load
    ], tmout=3, by_method='css', returnErr=True)
    if not isinstance(soup, str): return selectForList(soup, selRef)
    return {'Error': f'failed to scrape {profileUrl} - {soup}'}

Setting returnErr=True returns the error message (a string instead of the BeautifulSoup object) if anything goes wrong. ecx should be set based on which part/s you want to load (it's a list so it can have multiple selectors). tmout doesn't have to be passed (default is 25sec), but if it is, it should be adjusted for the other arguments and your own device and browser speeds - on my browser, tmo=0.01 is enough to load user details, but loading the first tweets takes at least tmo=2.

I wrote scrapeTwitterProfile mostly so that I could get tuDets [below] in one line. The for-loop after that is just for printing the results.

tuDets = [scrapeTwitterProfile(url) for url in profileUrls]
for url, d in zip(profileUrls, tuDets):
    print('\nFrom', url)
    for k, v in d.items(): print(f'\t{k}: {miniStr(v)}')

snscrape Example

snscrape has a module for Twitter that can be used to access Twitter data without having registered up for the API yourself. The example below prints a similar output to the previous example, but is faster.

# import snscrape.modules.twitter as sns_twitter
# def miniStr(o): return ' '.join(w for w in str(o).split() if w)

# profileIDs = [url.split('twitter.com/', 1)[-1].split('/')[0] for url in profileUrls]
profileIDs = ['twitterID', 'jokowi', 'sep_colin']
keysList = ['username', 'id', 'displayname', 'description', 'url']

for pid in profileIDs:
    tusRes, defVal = sns_twitter.TwitterUserScraper(pid).entity, 'no such attribute'
    print('\nfor ID', pid)
    for k in keysList: print('\t', k, ':', miniStr(getattr(tusRes, k, defVal)))

You can get most of the attributes in .entity with .__dict__ or print them all all with something like

print('\n'.join(f'{a}: {miniStr(v)}' for a, v in [
    (n, getattr(tusRes, n)) for n in dir(tusRes)
] if not (a[:1] == '_' or callable(v))))

See this example from this tutorial if you are interested in scraping tweets as well.

Upvotes: 1

Related Questions