Reputation: 74795

Pandas' `to_csv` doesn't behave the same way as printing

Consider the following sequence of operations:

Create a data frame with two columns with the following types int64, float64
Create a new frame by converting all columns to object
Inspect the new data frame
Persist the new data frame
Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64

Illustrated below:

# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})  

# Step 2
df2 = df.astype(object)

# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       4 non-null      object
 1   b       4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes

# NOTE notice how column `b` is rendered
df2
   a       b
0  3       1
1  2  500.43
2  1  256.13
3  0       5

# Step 4
df2.to_csv("/tmp/df2", index=False,  sep="\t")

Now let us inspect the generated output:

$ cat df2
a   b
3   1.0
2   500.43
1   256.13
0   5.0

Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?

I'm using Pandas 1.1.2 with Python 3.7.9.

Upvotes: 2

Answers (2)

Alim

Reputation: 81

I think, 'object' is NumPy/pandas dtype and not one of the python data types. If you run:

type(df2.iloc[0,1])

before step 4, you will get 'float' data type even though it's been already changed to 'object'.

You can use:

df.to_csv("df.csv",float_format='%g', index=False, sep="\t")

instead of casting in step 2.

Upvotes: 1

Joe Ferndz

Reputation: 8508

I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.

Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0

values in df are mix of string, float, ints

import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})  
df2 = df.astype(object)
def convert(x):
    a = []
    for i in x.to_list():
        a.append(coerce(i))
    return pd.Series(a)
        
    #return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])

def coerce(y):
    try:
        p = float(y)
        q = int(y)
        if p != q:
            return str(p)
        else:
            return str(q)
    except:
        return str(y)

df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")

Output in the file will be:

all values in df are numeric (integers or floats)

import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})  
df2 = df.astype(object)
def convert(x):
    return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")

The output is as follows:

Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.

Upvotes: 0

Pandas&#39; `to_csv` doesn&#39;t behave the same way as printing

Answers (2)

values in df are mix of string, float, ints

all values in df are numeric (integers or floats)

Related Questions

Pandas' `to_csv` doesn't behave the same way as printing