Valerio Storch
Valerio Storch

Reputation: 301

Pandas read_csv dtype inference on file with many int columns, except index and columns are string

I need to load a big .csv file (with something like 10 million records) for my recommender I am building. My input file looks like this (with k near ~400 columns):

      P1    P2    ... Pk

a      1     1    ...  0
b      0     0    ...  0
c      0     0    ...  1

I try to read my file with this call:

pd.read_csv(url,header=0, sep="\t",index_col=0,encoding="utf-8")

When I read the file, Pandas incorrectly guesses that all the numbers in my data are floats. I want to force the data to be 'int' type in order to save memory in the loading process. I tried to use the option: dtype=int, but this issued an error:

ValueError: invalid literal for int() with base 10: 'a'

I guess that this is due to the fact that my index and columns are strings.

I know that I could try to use a dictionary to specify the data types for the columns manually, but since I am building a recommender don't know the columns and the indexes of my files in advance, and I want to avoid to re-create the dictionary each time a new file is lodaded.

How can I specify to the read_csv method to set the integer type only on the data of my table, and not for the index and the column names?

Upvotes: 5

Views: 4033

Answers (2)

normanius
normanius

Reputation: 9762

Approach 1: In case you have only a few columns with non-default data types, you could use a defaultdict:

from collections import defaultdict
dtypes = defaultdict(lambda: int)
dtypes["index_column"] = str 
dtypes["other_special_column"] = object
# ...
df = pd.read_csv(path, dtype=dtypes, ...)

How this works: dtypes["something"] returns by default type int, except for the columns that have been specified beforehand.

Approach 2: In case the dtype can be inferred safely by reading only a part of the .csv, you could do the following:

n = 1000
df = pd.read_csv(path, nrows=n, ...)
df = pd.read_csv(path, dtype=df.dtypes, ...)

Upvotes: 1

Phung Duy Phong
Phung Duy Phong

Reputation: 896

Method 1) Use apply() on dataframe with a function which does error-safe coercion to int if it can:

df = pd.read_csv(url,header=0, sep="\t",index_col=0,encoding="utf-8")

def check_to_int(x):
    try:
        return int(x)
    except:
        return x

for i in df.columns:
    df[i] = df[i].apply(check_to_int)

If have any further problem with datatype (which is like), please post.

Method 2) Dynamically read the header row(s) of your dataframe to detect which columns are int/float (given you don't know your csv column names), then create a dict for dtypes with those names.

For example, if I had the dataframe:

    |user_id    |screen_name    |isocode    |location_name   |location_prob
0   |1058941868 |scottspur      |           |                |
1   |1058941921 |Roxy22Bennett  |           |                |
2   |105894357  |MerrynPreece   |GB         |United Kingdom  |0.998043

So I must check the '2' row:

a = pd.read_csv('Result_Phong1.csv',header=0, encoding="utf-8", nrows = 3)
a.fillna('', inplace=True)

temp = []
for i in a.loc[2,:].index:
    if type(a.loc[2,:][i]) == float:
        temp.append(i)

and the result would be:

Out[46]: [u'location_prob']

Then you can create a dict of them to pass in read_csv function.

Upvotes: 0

Related Questions