Guang Gao
Guang Gao

Reputation: 13

Selenium webdriver and URIError: "String contained an illegal UTF-16 sequence"

Background: I just learned how to use "Webdriver" and "Beautifulsoup" for two days.

Problem: I use the following code to download a webpage:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.PhantomJS(executable_path)
driver.get('https://mojim.com/twy100468x17x18.htm')
pageSource = driver.page_source
...

then, I encountered this error

WebDriverException: Message: URIError - String contained an illegal UTF-16 sequence.

Try: I try to replace pageSource = browser.page_source with
(driver.page_source).encode('ascii', 'ignore')
(driver.page_source).encode('utf-8') (suggested by here)
but still end in with the same error....

Page Source here

What should I do? Is there an illegal text in the html or what?
Thank you

Upvotes: 1

Views: 979

Answers (1)

Alexey Trofimov
Alexey Trofimov

Reputation: 5007

Ive just overcome this situation. This is caused by different non UTF chars enter image description here

I solved this surprisingly with Edge driver (Chrome and Mozilla doesnt handle that). So you can use it:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Edge()
driver.get('https://mojim.com/twy100468x17x18.htm')
pageSource = driver.page_source

The thing is that Edge is not headless like PhantomJS so when scraping i use it only on this bad excepted links. Also Egde is almost as fast as PhantomJS.

Upvotes: 1

Related Questions