NK_
NK_

Reputation: 391

Python: Replace values while reading CSV file

I have a CSV file with several columns that include integers and a string. Naturally, I get a dtype warning because of the mixed dtypes. I read the file with this general command.

df = pd.read_csv(path, sep=";", na_values=missing)

I could use low_memory=False or dtype=object to silence the warning but as far as I know this makes reading my file not more memory efficient.

I could also use na_values="my_string" but I have other missing values (which are supposed to be real missing values) and do not want to mix them.

I do not need the value of the string but only its value count so I thought of replacing it with an integer. Something like this.

df.replace(to_replace="my_string", value=999)

However, is it also possible to replace a value while reading a CSV file? Or does another solution exist? I do not want to simply silence the warning but find a solution which is more memory efficient.

(I know about this answer but it does not really help me with my problem.)

Upvotes: 4

Views: 13864

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

You can use converters:

In [156]: def conv(val, default_val=999):
     ...:     try:
     ...:         return int(val)
     ...:     except ValueError:
     ...:         return default_val
     ...:

In [157]: conv('a')
Out[157]: 999

In [158]: pd.read_csv(r'C:\Temp\test.csv', converters={'a':conv})
Out[158]:
     a   b           c
0    1  11  2000-01-01
1  999  12  2000-01-02
2    3  13  2000-01-02

Another approach, would be to convert columns in a vectorized way after parsing a CSV file:

In [166]: df = pd.read_csv(r'C:\Temp\test.csv', parse_dates=['c'])

In [167]: df
Out[167]:
     a    b          c
0    1  AAA 2000-01-01
1  XXX   12 2000-01-02
2    3   13 2000-01-02

In [168]: df.dtypes
Out[168]:
a            object
b            object
c    datetime64[ns]
dtype: object

In [169]: int_cols = ['a','b']

In [170]: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)

In [171]: df
Out[171]:
     a    b          c
0    1  999 2000-01-01
1  999   12 2000-01-02
2    3   13 2000-01-02

In [172]: df.dtypes
Out[172]:
a             int32
b             int32
c    datetime64[ns]
dtype: object

Speed comparison for 300.000 rows DF:

In [175]: df = pd.concat([df] * 10**5, ignore_index=True)

In [176]: df.shape
Out[176]: (300000, 3)

In [177]: filename = r'C:\Temp\test.csv'

In [184]: df.to_csv(filename, index=False)

In [185]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'], converters={'a':conv, 'b':conv})
     ...:
632 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [186]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'])
     ...: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)
     ...:
706 ms ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 5

David Jimenez
David Jimenez

Reputation: 71

It's no possible to replace de values while you are reading a CSV file. You have to replace once you load the data and save it. Then you don't get warning anymore.

Upvotes: 1

Related Questions