Olivia
Olivia

Reputation: 111

UnicodeEncodeError: 'charmap' codec can't encode character '\u5347' in position 68: character maps to <undefined>

I am new to python. I read data from SQL Server and then write the data into a csv file. The table row has both number, string and datetime values. I tried different ways to write the data. For example,

#method 1
import pandas as pd

df = pd.DataFrame(table, columns=["colummn"])

df.to_csv('list.csv', index=False)*  

#method 2
import csv

fl = open('OnlineplayDatabase.csv', 'w')

writer = csv.writer(fl)

for row in table:

    writer.writerow(row)

fl.close()    

Both methods are normally working. But when some rows contain Chinese characters (see example below), I received an encoding error. The error message says:

codecs.charmap_encode(input,self.errors,encoding_table)[0]

#Error Code   
UnicodeEncodeError: 'charmap' codec can't encode character '\u5347' in position 68: character maps to <undefined>

I tried to encode the fields in the row using utf-8. But some of the fields are numbers.

Your help is highly appreciated!

('120.239.9.116  ',
 'gyandroid ',
 4,
 9,
 'Dalvik/1.6.0(Linux;U;Android4.4.2;升级版Build/KVT49L)                                                                      datetime.datetime(2016, 6, 11, 20, 54, 19),
 datetime.datetime(2016, 6, 11, 20, 56, 53),
 11521.0)

Upvotes: 1

Views: 2866

Answers (2)

Niharika Bitra
Niharika Bitra

Reputation: 477

Try this for method #2:

#method 2
import csv

fl = open('OnlineplayDatabase.csv', 'w', encoding='utf8') #set the encoding to utf8
writer = csv.writer(fl)

for row in table:
    writer.writerow(row)

fl.close()    

Also take a look at this - http://www.pgbovine.net/unicode-python-errors.htm

Upvotes: 3

Clock Slave
Clock Slave

Reputation: 7967

Look at the error again. This is happening because somewhere in your dataframe there are words that begin with \u. You need to get rid of that. See if this works. Use the remove_u function below to get rid of the \u.

def remove_u(word):
    word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
    if r'\u' in word_u: 
        # print(True)
        return word_u.split('\\u')[1]
    return word

df.loc[:, 'colummn'] = df['colummn'].apply(func = remove_u)

Once you have updated the dataframe, try writing it out again.

EDIT

I am assuming your column is composed of individual words. If your column has strings instead use the modified version of the remove_u

def remove_u(input_string):
    words = input_string.split()
    words_u = [(word.encode('unicode-escape')).decode("utf-8", "strict") for word in words]
    words_u = [word_u.split('\\u')[1] if r'\u' in word_u else word_u for word_u in words_u]
    # print(words_u)
    return ' '.join(words_u)

Upvotes: 0

Related Questions