jguy
jguy

Reputation: 211

Problem in scraping data in non-english character sites [Python]

I am trying to scrape the number of post within hashtag, it works perfect with the following code:

from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime

driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/hkig')
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')

post = soup.find('span', class_='g47SY ').text
print(post)

However, if I change the tag to non-english characters, it crashes, what is the cause and how to solve it?

Following script will give out errors:

from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime    

driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/モデル')
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')

post = soup.find('span', class_='g47SY ').text
print(post)

EDITED:

The error I got is as follow: Traceback (most recent call last):

File "C:/Users/user/Desktop/temp.py", line 12, in post = soup.find('span', class_='g47SY ').text AttributeError: 'NoneType' object has no attribute 'text'

It appears that the beautifulsoup cannot find anything in 'span', class_='g47SY ' so it give out such error, so back to my question why is it unable to find it? I checked the number of post's element indicating it is still <span class="g47SY ">6,262,389</span>, perhaps its about the utf-8 ascii encoding issues?

Upvotes: 2

Views: 454

Answers (2)

ewwink
ewwink

Reputation: 19154

you need to wait using WebDriverWait until the element with class name g47SY located, and it better to not use BeautifulSoup if using Selenium.

driver.get('https://www.instagram.com/explore/tags/モデル')
post = WebDriverWait(driver, 10).until(
    lambda driver: driver.find_element_by_class_name('g47SY')
)
print(post.text)

Upvotes: 2

Dev
Dev

Reputation: 2813

Whenever scraping data using selenium consider to add sleep in most cases it takes time to load page and hence entire source code of page not caught up. for reference look at workable code below

from selenium import webdriver
import bs4 as bs
import pandas as pd
import datetime
import time        #note this line

driver = webdriver.Chrome()
driver.get('https://www.instagram.com/explore/tags/モデル')
time.sleep(8)                                          #note this as well moreover it should be after get method 
source = driver.execute_script("return document.body.innerHTML")
soup = bs.BeautifulSoup(source,'lxml')
print(soup)

post = soup.find('span', class_='g47SY ').text
print(post)

Upvotes: 2

Related Questions