Reputation: 301
I need to load a big .csv file (with something like 10 million records) for my recommender I am building. My input file looks like this (with k near ~400 columns):
P1 P2 ... Pk
a 1 1 ... 0
b 0 0 ... 0
c 0 0 ... 1
I try to read my file with this call:
pd.read_csv(url,header=0, sep="\t",index_col=0,encoding="utf-8")
When I read the file, Pandas incorrectly guesses that all the numbers in my data are floats. I want to force the data to be 'int' type in order to save memory in the loading process. I tried to use the option: dtype=int
, but this issued an error:
ValueError: invalid literal for int() with base 10: 'a'
I guess that this is due to the fact that my index and columns are strings.
I know that I could try to use a dictionary to specify the data types for the columns manually, but since I am building a recommender don't know the columns and the indexes of my files in advance, and I want to avoid to re-create the dictionary each time a new file is lodaded.
How can I specify to the read_csv
method to set the integer type only on the data of my table, and not for the index and the column names?
Upvotes: 5
Views: 4033
Reputation: 9762
Approach 1: In case you have only a few columns with non-default data types, you could use a defaultdict:
from collections import defaultdict
dtypes = defaultdict(lambda: int)
dtypes["index_column"] = str
dtypes["other_special_column"] = object
# ...
df = pd.read_csv(path, dtype=dtypes, ...)
How this works: dtypes["something"]
returns by default type int
, except for the columns that have been specified beforehand.
Approach 2: In case the dtype
can be inferred safely by reading only a part of the .csv, you could do the following:
n = 1000
df = pd.read_csv(path, nrows=n, ...)
df = pd.read_csv(path, dtype=df.dtypes, ...)
Upvotes: 1
Reputation: 896
apply()
on dataframe with a function which does error-safe coercion to int if it can:df = pd.read_csv(url,header=0, sep="\t",index_col=0,encoding="utf-8")
def check_to_int(x):
try:
return int(x)
except:
return x
for i in df.columns:
df[i] = df[i].apply(check_to_int)
If have any further problem with datatype (which is like), please post.
dtypes
with those names.For example, if I had the dataframe:
|user_id |screen_name |isocode |location_name |location_prob
0 |1058941868 |scottspur | | |
1 |1058941921 |Roxy22Bennett | | |
2 |105894357 |MerrynPreece |GB |United Kingdom |0.998043
So I must check the '2' row:
a = pd.read_csv('Result_Phong1.csv',header=0, encoding="utf-8", nrows = 3)
a.fillna('', inplace=True)
temp = []
for i in a.loc[2,:].index:
if type(a.loc[2,:][i]) == float:
temp.append(i)
and the result would be:
Out[46]: [u'location_prob']
Then you can create a dict of them to pass in read_csv function.
Upvotes: 0