Reputation: 168
I want to import a csv dataset. My problem is when I import the base, pandas kinds of try to convert it into something.
Let me explain with numbers. This is more or less how my csv file is.
> Data, Id, Text
>2018-06-11, 17980873.3391, bla bla bla
>2018-06-11, 17980874.4560, bla bla bla
>2018-06-11, 17980876.8560, bla bla bla
The trouble is when I import it with pd.read_csv. The Id column should be exactly the way it is in csv file (I want to use it as a filter to do searches) . But pandas is returning something like:
When I import with no changes in the structure (pandas transform the column into float automatically)
> Data, Id, Text
>2018-06-11, 17980873.33910, bla bla bla
>2018-06-11, 17980874.45600, bla bla bla
>2018-06-11, 17980876.85600, bla bla bla
when I import the dataset and transform the id column as type(str):
> Data, Id, Text
>2018-06-11, 17980873.3391, bla bla bla
>2018-06-11, 17980874.456, bla bla bla
>2018-06-11, 17980876.856, bla bla bla
It is deleting and adding 0. I really don't know how to make pandas import the real number
>17980876.8560
Hope I've made myself understood. I'm still learning how to ask questions here.
Thanks!
Upvotes: 2
Views: 427
Reputation: 164783
You should first understand that Pandas isn't reading in your number as a decimal 17980873.33910
. It is reading it into your dataframe as a float
, which counts in base-2 rather than base-10. Any numbers you see thenceforth is a string representation of a float
, nothing more.
In general, you shouldn't be looking to convert numeric data to strings. The process is expensive, any comparisons are expensive and you will be dealing with a series of pointers rather than data held in contiguous memory blocks. The last point is a principle benefit of using Pandas, as it enables vectorised operations.
Now to your core problem:
The Id column should be exactly the way it is in csv file (I want to use it as a filter to do searches).
You should use numpy.isclose
to compare floats. This function works by setting a tolerance level, within which two numbers are deemed to be the same. Here's an example:
s = pd.Series([1.4532400, 67.1234, 54.4556, 765.32414])
res = np.isclose(s, 1.45324)
print(res)
array([ True, False, False, False])
Then to filter your series:
s_filtered = s[res]
print(s_filtered)
0 1.45324
dtype: float64
Here's a performance comparison:
s = pd.Series([1.4532400, 67.1234, 54.4556, 765.32414])
s = pd.concat([s]*100000)
s2 = s.astype(str)
%timeit np.isclose(s, 1.45324) # 5.02 ms
%timeit s2.astype(str) == '1.45324' # 79.5 ms
Upvotes: 1
Reputation: 177981
Set the dtype
for the Id
to str
for no translation.
Given:
Data,Id,Text
2018-06-11,17980873.3391,bla bla bla
2018-06-11,17980874.4560,bla bla bla
2018-06-11,17980876.8560,bla bla bla
Use:
import pandas as pd
data = pd.read_csv('data.csv',dtype={'Id':str})
print(data)
To get:
Data Id Text
0 2018-06-11 17980873.3391 bla bla bla
1 2018-06-11 17980874.4560 bla bla bla
2 2018-06-11 17980876.8560 bla bla bla
This does assume your ID field is intended to be an 8-digit dot 4-digit string and not a floating point value.
Upvotes: 2