shahid hamdam
shahid hamdam

Reputation: 821

utf8 encoding issues in python windows

I am processing a file on windows OS in Python. I am getting errors like Unicode error surrogates not allowed.

Sample Text from Document:

Ten states led by Texas Attorney General Ken Paxton (R) filed an antitrust lawsuit against 
Google on Wednesday, alleging the tech giant illegally sought to suppress competition and 
reap massive profits from targeted advertisements placed across the Web.

The lawsuit — filed in a Texas federal court and backed exclusively by Republicans — strikes 
at the heart of Google’s lucrative business in connecting those who seek to buy online ads 
with the websites that sell them. Paxton and his GOP allies contend that Google relied on a 
mix 
of improper tactics to force its ad tools on publishers and solidify its pole position as a 
“middleman” in the invisible transactions that power much of the Web.

Online advertising is expected to generate $42 billion in revenue this year for Google, 
which captures a third of all digital ad spending, according to an October projection from 

the firm eMarketer. Google’s vast reach led Texas and other state attorneys general to c onclude in their lawsuit that the tech giant essentially had built the “largest electronic trading market in existence,” operating ad systems that are not unlike trades on a stock exchange.

Code1:

return_doc.to_csv(path, index= False)

Error1: UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9d' in position 168: surrogates not allowed

Code2:

return_doc.to_csv(path, index= False, encoding='cp1252')

Error2: UnicodeEncodeError: 'charmap' codec can't encode character '\udc9d' in position 168: character maps to

Code3:

return_doc.to_csv(path, index= False, encoding='ISO 8859-15')

Error3: UnicodeEncodeError: 'charmap' codec can't encode character '\u201d' in position 14: character maps to

I have used Code4:

return_doc.to_csv(path, index= False, encoding='cp1252', errors='replace)

The text from

“The actions harm every person in America,” Paxton said in a video statement preceding the 
case, which asked a judge to consider “structural” remedies that could theoretically include 
forcing a breakup of the company.

converted into

“The actions harm every person in America,�? Paxton said in a video statement preceding 
the case, which asked a judge to consider “structural�? remedies that could 
theoretically include forcing a breakup of the company.

Which I don't want to happen.

PLease suggest me a solution where I don't get any error and don't get text changed.

Upvotes: 1

Views: 1067

Answers (2)

methane
methane

Reputation: 479

When stdio is console, Python uses UTF-8 by default. But if stdio is redirected (e.g. file or pipe), Python uses ANSI code page encoding.

You can use UTF-8 mode to use UTF-8 by default for text encoding. See https://docs.python.org/3/using/windows.html#utf-8-mode for reference.

Upvotes: 1

shahid hamdam
shahid hamdam

Reputation: 821

I did some R&D and found a solution.

sys.stdin.reconfigure(encoding='utf-8')

What this does is letting windows know what encoding to use when printing text.

Upvotes: 0

Related Questions