Segmented
Segmented

Reputation: 2044

Pandas (Python) reading and working on Java BigInteger/ large numbers

I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?

EDIT: I have tried the implementations proposed in the comments but that does not work for me. Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106

Upvotes: 3

Views: 2125

Answers (2)

Segmented
Segmented

Reputation: 2044

As explained by @JohnE in his answer that we do not lose any information while reading big numbers using Pandas. They are stored as dtype=object, to make numerical computation on them we need to transform this data into numerical type.

For series:

We have to apply the map(func) to the series in the dataframe:

df['columnName'].map(int)

Whole dataframe:

If for some reason, our entire dataframe is composed of columns with dtype=object, we look at applymap(func)

from the documentation of Pandas:

DataFrame.applymap(func): Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame

so to transform all columns in dataframe:

 df.applymap(int)

Upvotes: 1

JohnE
JohnE

Reputation: 30444

Start with a simple text file to read in, just one variable and one row.

%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106

In [268]: df=pd.read_csv('foo.txt')

Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.

In [269]: df.x
Out[269]: 
0    7724175622144176202888140209281712501772444730...
Name: x, dtype: object

In [270]: type(df.x[0])
Out[270]: str

And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.

You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)

In [271]: df.x = df.x.map(int)

And then can more or less treat it like a number.

In [272]: df.x * 2
Out[272]: 
0    1544835124428835240577628041856342500354488946...
Name: x, dtype: object

You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.

In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)

Upvotes: 1

Related Questions