Pandas - Importing CSV issue - adding and deleting 0 at the end

I want to import a csv dataset. My problem is when I import the base, pandas kinds of try to convert it into something.

Let me explain with numbers. This is more or less how my csv file is.

> Data, Id, Text

>2018-06-11, 17980873.3391, bla bla bla 

>2018-06-11, 17980874.4560, bla bla bla

>2018-06-11, 17980876.8560, bla bla bla

The trouble is when I import it with pd.read_csv. The Id column should be exactly the way it is in csv file (I want to use it as a filter to do searches) . But pandas is returning something like:

When I import with no changes in the structure (pandas transform the column into float automatically)

> Data, Id, Text

>2018-06-11, 17980873.33910, bla bla bla 

>2018-06-11, 17980874.45600, bla bla bla

>2018-06-11, 17980876.85600, bla bla bla

when I import the dataset and transform the id column as type(str):

> Data, Id, Text

>2018-06-11, 17980873.3391, bla bla bla 

>2018-06-11, 17980874.456, bla bla bla

>2018-06-11, 17980876.856, bla bla bla

It is deleting and adding 0. I really don't know how to make pandas import the real number

>17980876.8560

Hope I've made myself understood. I'm still learning how to ask questions here.

Thanks!

Upvotes: 2

Views: 427

Answers (2)

jpp
jpp

Reputation: 164783

You should first understand that Pandas isn't reading in your number as a decimal 17980873.33910. It is reading it into your dataframe as a float, which counts in base-2 rather than base-10. Any numbers you see thenceforth is a string representation of a float, nothing more.

In general, you shouldn't be looking to convert numeric data to strings. The process is expensive, any comparisons are expensive and you will be dealing with a series of pointers rather than data held in contiguous memory blocks. The last point is a principle benefit of using Pandas, as it enables vectorised operations.

Now to your core problem:

The Id column should be exactly the way it is in csv file (I want to use it as a filter to do searches).

You should use numpy.isclose to compare floats. This function works by setting a tolerance level, within which two numbers are deemed to be the same. Here's an example:

s = pd.Series([1.4532400, 67.1234, 54.4556, 765.32414])
res = np.isclose(s, 1.45324)

print(res)
array([ True, False, False, False])

Then to filter your series:

s_filtered = s[res]

print(s_filtered)

0    1.45324
dtype: float64

Here's a performance comparison:

s = pd.Series([1.4532400, 67.1234, 54.4556, 765.32414])
s = pd.concat([s]*100000)
s2 = s.astype(str)

%timeit np.isclose(s, 1.45324)       # 5.02 ms
%timeit s2.astype(str) == '1.45324'  # 79.5 ms

Upvotes: 1

Mark Tolonen
Mark Tolonen

Reputation: 177981

Set the dtype for the Id to str for no translation.

Given:

Data,Id,Text
2018-06-11,17980873.3391,bla bla bla
2018-06-11,17980874.4560,bla bla bla
2018-06-11,17980876.8560,bla bla bla

Use:

import pandas as pd
data = pd.read_csv('data.csv',dtype={'Id':str})
print(data)

To get:

         Data             Id         Text
0  2018-06-11  17980873.3391  bla bla bla
1  2018-06-11  17980874.4560  bla bla bla
2  2018-06-11  17980876.8560  bla bla bla

This does assume your ID field is intended to be an 8-digit dot 4-digit string and not a floating point value.

Upvotes: 2

Related Questions