Manoj Govindan
Manoj Govindan

Reputation: 74705

Pandas' `to_csv` doesn't behave the same way as printing

Consider the following sequence of operations:

  1. Create a data frame with two columns with the following types int64, float64
  2. Create a new frame by converting all columns to object
  3. Inspect the new data frame
  4. Persist the new data frame
  5. Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64

Illustrated below:

# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})  

# Step 2
df2 = df.astype(object)

# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       4 non-null      object
 1   b       4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes

# NOTE notice how column `b` is rendered
df2
   a       b
0  3       1
1  2  500.43
2  1  256.13
3  0       5

# Step 4
df2.to_csv("/tmp/df2", index=False,  sep="\t")

Now let us inspect the generated output:

$ cat df2
a   b
3   1.0
2   500.43
1   256.13
0   5.0

Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?

I'm using Pandas 1.1.2 with Python 3.7.9.

Upvotes: 2

Views: 800

Answers (2)

Alim
Alim

Reputation: 81

I think, 'object' is NumPy/pandas dtype and not one of the python data types. If you run:

type(df2.iloc[0,1])

before step 4, you will get 'float' data type even though it's been already changed to 'object'.

You can use:

df.to_csv("df.csv",float_format='%g', index=False, sep="\t")

instead of casting in step 2.

Upvotes: 1

Joe Ferndz
Joe Ferndz

Reputation: 8508

I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.

Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0

values in df are mix of string, float, ints

import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})  
df2 = df.astype(object)
def convert(x):
    a = []
    for i in x.to_list():
        a.append(coerce(i))
    return pd.Series(a)
        
    #return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])

def coerce(y):
    try:
        p = float(y)
        q = int(y)
        if p != q:
            return str(p)
        else:
            return str(q)
    except:
        return str(y)

df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")

Output in the file will be:

a   b
3   1
2   500.43
1   256.13
s   5
t   txt

all values in df are numeric (integers or floats)

import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})  
df2 = df.astype(object)
def convert(x):
    return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")

The output is as follows:

a   b
3   1
2   500.43
1   256.13
0   5

Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.

Upvotes: 0

Related Questions