Reputation: 13
Background: I just learned how to use "Webdriver" and "Beautifulsoup" for two days.
Problem: I use the following code to download a webpage:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS(executable_path)
driver.get('https://mojim.com/twy100468x17x18.htm')
pageSource = driver.page_source
...
then, I encountered this error
WebDriverException: Message: URIError - String contained an illegal UTF-16 sequence.
Try: I try to replace pageSource = browser.page_source
with
(driver.page_source).encode('ascii', 'ignore')
(driver.page_source).encode('utf-8')
(suggested by here)
but still end in with the same error....
Page Source here
What should I do? Is there an illegal text in the html or what?
Thank you
Upvotes: 1
Views: 979
Reputation: 5007
Ive just overcome this situation. This is caused by different non UTF chars
I solved this surprisingly with Edge driver (Chrome and Mozilla doesnt handle that). So you can use it:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Edge()
driver.get('https://mojim.com/twy100468x17x18.htm')
pageSource = driver.page_source
The thing is that Edge is not headless like PhantomJS so when scraping i use it only on this bad excepted links. Also Egde is almost as fast as PhantomJS.
Upvotes: 1