Reputation: 211
I am trying to scrape the number of post within hashtag, it works perfect with the following code:
from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/hkig')
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')
post = soup.find('span', class_='g47SY ').text
print(post)
However, if I change the tag to non-english characters, it crashes, what is the cause and how to solve it?
Following script will give out errors:
from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/モデル')
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')
post = soup.find('span', class_='g47SY ').text
print(post)
EDITED:
The error I got is as follow: Traceback (most recent call last):
File "C:/Users/user/Desktop/temp.py", line 12, in post = soup.find('span', class_='g47SY ').text AttributeError: 'NoneType' object has no attribute 'text'
It appears that the beautifulsoup cannot find anything in 'span', class_='g47SY '
so it give out such error, so back to my question why is it unable to find it? I checked the number of post's element indicating it is still <span class="g47SY ">6,262,389</span>
, perhaps its about the utf-8 ascii encoding issues?
Upvotes: 2
Views: 454
Reputation: 19154
you need to wait using WebDriverWait
until the element with class name g47SY
located, and it better to not use BeautifulSoup if using Selenium.
driver.get('https://www.instagram.com/explore/tags/モデル')
post = WebDriverWait(driver, 10).until(
lambda driver: driver.find_element_by_class_name('g47SY')
)
print(post.text)
Upvotes: 2
Reputation: 2813
Whenever scraping data using selenium consider to add sleep
in most cases it takes time to load page and hence entire source code of page not caught up. for reference look at workable code below
from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime
import time #note this line
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/モデル')
time.sleep(8) #note this as well moreover it should be after get method
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')
print(soup)
post = soup.find('span', class_='g47SY ').text
print(post)
Upvotes: 2