Reputation: 517

Pandas Changing the format of NaN values when saving to CSV

I am working with a df and using numpy to transform data - including setting blanks (or '') to NaN. But when I write the df to csv - the output contains the string 'nan' as oppose to being NULL.

I have looked around but can't find a workable solution. Here's the basic issue:

df
index x    y   z
0     1   NaN  2
1     NaN  3   4

CSV output:

index x    y   z
0     1   nan  2
1     nan  3   4

I have tried a few things to set 'nan' to NULL but the csv output results in a 'blank' rather than NULL:

dfDemographics = dfDemographics.replace('nan', np.NaN)
dfDemographics.replace(r'\s+( +\.)|#', np.nan, regex=True).replace('', 
np.nan)
dfDemographics = dfDemographics.replace('nan', '')  # of course, this wouldn't work, but tried it anyway.

Any help would be appreciated.

Upvotes: 39

Answers (6)

Peter

Reputation: 11

I think someone posted this already but you turned it down due to showing the 'NULL' string in the file:

df.to_csv('file.csv', na_rep='NULL')

Now I understand you want an actual blank value there. CSVs are text files, the blank text is just the empty stting '', so the solution is to use that instead:

df.to_csv('file.csv', na_rep='')

This works for me. Some other solutions may still be valid, but this is by far the simplest and most straightforward. If this was a fixed width file, you would have to replace with spaces however wide that field was, but for csv you don't need the spaces, just the empty string.

PS: I would have made a comment on the other user's response, but I'm a new user therefore not allowed to post comments yet... that said, any upvotes appreciated.

Upvotes: 1

glisu

Reputation: 1116

Based on gherka's and Kranthi Kiran's answers:

df = df.replace("nan", '', regex=True)
df.to_csv("df.csv", index=False)

Upvotes: 1

gherka

Reputation: 1446

In my situation, the culprit was np.where. When the data types of the two return elements are different, then your np.NaN will be converted to a nan.

It's hard (for me) to see exactly what's going on under the hood, but I suspect this might be true for other Numpy array methods that have mixed types.

A minimal example:

import numpy as np
import pandas as pd

seq = [1, 2, 3, 4, np.NaN]
same_type_seq = np.where("parrot"=="dead", 0, seq)
diff_type_seq = np.where("parrot"=="dead", "spam", seq)

pd.Series(seq).to_csv("vanilla_nan.csv", header=False) # as expected, last row is blank
pd.Series(same_type_seq).to_csv("samey_nan.csv", header=False) # also, blank
pd.Series(diff_type_seq).to_csv("nany_nan.csv", header=False) # nan instead of blank

So how to get round this? I'm not too sure, but as a hacky workaround for small datasets, you can replace NaN in your original sequence with a token string and then replace it back to np.NaN

repl = "missing"
hacky_seq = np.where("parrot"=="dead", "spam", [repl if np.isnan(x) else x for x in seq])
pd.Series(hacky_seq).replace({repl:np.NaN}).to_csv("hacky_nan.csv", header=False)

Upvotes: 1

Good Fit

Reputation: 1316

User @coldspeed illustrates how to replace nan values with NULL when save pd.DataFrame. In case, for data analysis, one is interested in replacing the "NULL" values in pd.DataFrame with np.NaN values, the following code will do:

import numpy as np, pandas as pd

# replace NULL values with np.nan
colNames = mydf.columns.tolist()
dfVals = mydf.values
matSyb = mydf.isnull().values
dfVals[matSyb] = np.NAN

mydf = pd.DataFrame(dfVals, columns=colNames)    
#np.nansum(mydf.values, axis=0 )
#np.nansum(dfVals, axis=0 )

Upvotes: 0

cs95

Reputation: 402363

Pandas to the rescue, use na_rep to fix your own representation for NaNs.

df.to_csv('file.csv', na_rep='NULL')

file.csv

,index,x,y,z
0,0,1.0,NULL,2
1,1,NULL,3.0,4

Upvotes: 72

Kranthi Kiran

Reputation: 128

Using df.replace may help -

df = df.replace(np.nan, '', regex=True)
df.to_csv("df.csv", index=False)

(This sets all the null values to '' i.e empty string.)

Upvotes: 4

Pandas Changing the format of NaN values when saving to CSV

Answers (6)

Related Questions