Reputation: 34318
I have a dataframe in pandas which I would like to write to a CSV file.
I am doing this using:
df.to_csv('out.csv')
And getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)
Upvotes: 1154
Views: 2699954
Reputation: 375367
To delimit by a tab you can use the sep
argument of to_csv
:
df.to_csv(file_name, sep='\t')
To use a specific encoding (e.g. 'utf-8') use the encoding
argument:
df.to_csv(file_name, sep='\t', encoding='utf-8')
In many cases you will want to remove the index and add a header:
df.to_csv(file_name, sep='\t', encoding='utf-8', index=False, header=True)
Upvotes: 1537
Reputation: 51
I would avoid using the '\t'
separate and would create issues when reading the dataset again.
df.to_csv(file_name, encoding='utf-8')
Upvotes: 4
Reputation: 23011
errors=
is sometimes usefulIf a file has to have a certain encoding but the existing dataframe has characters that cannot be represented, errors=
can be used to "coerce" the data to be saved anyway at the cost of losing information. All possible values that can be passed as the errors=
argument to the open()
function in Python can be passed here.
For example, the below code saves a csv with ascii encoding where the Japanese characters are replaced with a ?
.
df = pd.DataFrame({'A': ['Shohei Ohtani は一生に一度の選手だ。']})
df.to_csv('data1.csv', encoding='ascii', errors='replace', index=False)
print(pd.read_csv('data1.csv'))
A
0 Shohei Ohtani ???????????
float_format=
is sometimes usefulYou can format float dtypes using float_format=
and doing so saves a lot of memory sometimes at the cost of losing precision. For example,
df = pd.DataFrame({'A': [*range(1,9,3)]*1000})/3
df.to_csv('data1.csv', index=False) # 61,440 bytes on disk
df.to_csv('data2.csv', index=False, float_format='%.2f') # 20,480 bytes on disk
Since pandas 1.0.0, you can pass a dict to compression that specifies compression method and file name inside the archive. The below code creates a zip file named compressed_data.zip
which has a single file in it named data.csv
.
df.to_csv('compressed_data.zip', index=False, compression={'method': 'zip', 'archive_name': 'data.csv'})
# read the archived file as a csv
pd.read_csv('compressed_data.zip')
You can even add to an existing archive; simply pass mode='a'
.
df.to_csv('compressed_data.zip', compression={'method': 'zip', 'archive_name': 'data_new.csv'}, mode='a')
Upvotes: 2
Reputation: 677
If above solution not working for anyone or the CSV is getting messed up, just remove sep='\t'
from the line like this:
df.to_csv(file_name, encoding='utf-8')
Upvotes: 21
Reputation: 15152
Example of export in file with full path on Windows and in case your file has headers:
df.to_csv (r'C:\Users\John\Desktop\export_dataframe.csv', index = None, header=True)
For example, if you want to store the file in same directory where your script is, with utf-8 encoding and tab as separator:
df.to_csv(r'./export/dftocsv.csv', sep='\t', encoding='utf-8', header='true')
Upvotes: 40
Reputation: 1650
it could be not the answer for this case, but as I had the same error-message with .to_csv
I tried .toCSV('name.csv')
and the error-message was different ("SparseDataFrame' object has no attribute 'toCSV'
). So the problem was solved by turning dataframe to dense dataframe
df.to_dense().to_csv("submission.csv", index = False, sep=',', encoding='utf-8')
Upvotes: 11
Reputation: 402253
To write a pandas DataFrame to a CSV file, you will need DataFrame.to_csv
. This function offers many arguments with reasonable defaults that you will more often than not need to override to suit your specific use case. For example, you might want to use a different separator, change the datetime format, or drop the index when writing. to_csv
has arguments you can pass to address these requirements.
Here's a table listing some common scenarios of writing to CSV files and the corresponding arguments you can use for them.
Footnotes
- The default separator is assumed to be a comma (
','
). Don't change this unless you know you need to.- By default, the index of
df
is written as the first column. If your DataFrame does not have an index (IOW, thedf.index
is the defaultRangeIndex
), then you will want to setindex=False
when writing. To explain this in a different way, if your data DOES have an index, you can (and should) useindex=True
or just leave it out completely (as the default isTrue
).- It would be wise to set this parameter if you are writing string data so that other applications know how to read your data. This will also avoid any potential
UnicodeEncodeError
s you might encounter while saving.- Compression is recommended if you are writing large DataFrames (>100K rows) to disk as it will result in much smaller output files. OTOH, it will mean the write time will increase (and consequently, the read time since the file will need to be decompressed).
Upvotes: 62
Reputation: 6095
When you are storing a DataFrame
object into a csv file using the to_csv
method, you probably wont be needing to store the preceding indices of each row of the DataFrame
object.
You can avoid that by passing a False
boolean value to index
parameter.
Somewhat like:
df.to_csv(file_name, encoding='utf-8', index=False)
So if your DataFrame object is something like:
Color Number
0 red 22
1 blue 10
The csv file will store:
Color,Number
red,22
blue,10
instead of (the case when the default value True
was passed)
,Color,Number
0,red,22
1,blue,10
Upvotes: 399
Reputation: 9996
Something else you can try if you are having issues encoding to 'utf-8' and want to go cell by cell you could try the following.
Python 2
(Where "df" is your DataFrame object.)
for column in df.columns:
for idx in df[column].index:
x = df.get_value(idx,column)
try:
x = unicode(x.encode('utf-8','ignore'),errors ='ignore') if type(x) == unicode else unicode(str(x),errors='ignore')
df.set_value(idx,column,x)
except Exception:
print 'encoding error: {0} {1}'.format(idx,column)
df.set_value(idx,column,'')
continue
Then try:
df.to_csv(file_name)
You can check the encoding of the columns by:
for column in df.columns:
print '{0} {1}'.format(str(type(df[column][0])),str(column))
Warning: errors='ignore' will just omit the character e.g.
IN: unicode('Regenexx\xae',errors='ignore')
OUT: u'Regenexx'
Python 3
for column in df.columns:
for idx in df[column].index:
x = df.get_value(idx,column)
try:
x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
df.set_value(idx,column,x)
except Exception:
print('encoding error: {0} {1}'.format(idx,column))
df.set_value(idx,column,'')
continue
Upvotes: 25
Reputation: 186
Sometimes you face these problems if you specify UTF-8 encoding also. I recommend you to specify encoding while reading file and same encoding while writing to file. This might solve your problem.
Upvotes: 17