Francisco Cortes
Francisco Cortes

Reputation: 1226

How to prevent Python/pandas from treating ids like numbers

I have a data set in a csv with some id's that are very long numbers like this:

963839330864351104
426545668232740352
811862613586429056

when I read the csv and convert my dataset into a dataframe, pandas incorrectly thinks it is a number and converts them to scientific notifation so those numbers above become this:

9.638393308643511e+17
4.2654566823274035e+17
8.11862613586429e+17

I tried to resolve this by turning the series to int64 and that works for the most part but here is the problem:

some numbers may look like this orignally

191268415833367997

now, after the number goes from scientific notation to int64 (or int), the number becomes this:

191268415833368000 #some how pandas think that it's ok to round the number up

what I need is to make pandas understand that the column that holds those id's in the dataset is a string column and it should not be treated as number, possibly as of the import so I'm not converting numbers into ints and then later into strings while I format work and transform my dataset to fit my needs.

what I've tried the following but it has not worked:

  1. I've tried converting the series into int and int64
  2. I've tried reading the csv with float_precision='round_trip'
  3. I've tried to convert the series holding the id's into a string
  4. I've tried to set the display differently: pd.options.display.float_format = '{:.0f}'.format

I've seen this question being asked in different ways but there's no definite answer or something I've been able to understand to adapt to my issue. it should be straigt forward but somehow it's not.

help is appreciated it.

Upvotes: 3

Views: 1389

Answers (2)

Francisco Cortes
Francisco Cortes

Reputation: 1226

I found this answer on another question:

Pandas reading csv as string type

this did the trick

pd.read_csv(f, dtype=str)

thank you

Upvotes: 0

Cameron Riddell
Cameron Riddell

Reputation: 13417

Since you mentioned you're loading from csv, you can simply inform pandas you want to treat that column as a string:

from io import StringIO
from pandas import read_csv

data = StringIO('''
id1,id2
963839330864351104,963839330864351104
426545668232740352,426545668232740352
811862613586429056,811862613586429056
191268415833367997,191268415833367997
''')

df = read_csv(data, dtype={'id1': str, 'id2': float})

print(
    df, df.dtypes, sep='\n'*2
)
                  id1           id2
0  963839330864351104  9.638393e+17
1  426545668232740352  4.265457e+17
2  811862613586429056  8.118626e+17
3  191268415833367997  1.912684e+17

id1     object
id2    float64
dtype: object

In this case, read_csv treats 'id1' as string values, and 'id2' as numeric (float) values since it detects that those are floating point numbers.

Upvotes: 6

Related Questions