Reputation: 67
Python read csv to Dataframe, stuggeling with date columns
Hi all,
I have problems by reading in a csv which looks like:
col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-
|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57 |- |60 |03.02.2020 |- |06.07.2020|
|126 |8 |- |03.02.2020 |04.03.2020 |06.07.2020|
|- |45 |- |30.01.2020 |29.02.2020 |29.06.2020|
|106 |83 |189 |- |29.02.2020 |29.06.2020|
|- |12 |84 |30.01.2020 |29.02.2020 |- |
Here is how I tried to read in the CSV.
import pandas as pd
df_puma = pd.read_csv(test.csv, sep=";",dayfirst=True, parse_dates=['Col_Date_1','Col_Date_2','Col_Date_3'], encoding='latin-1')
Unfortunately, both kinds of columns (the first 3 integers and the last 3 with dates) are not automatically in the right type.
df.info()
----------
col_A 404 non-null object
col_B 404 non-null object
col_C 404 non-null object
Col_Date_1 404 non-null object
Col_Date_2 404 non-null object
Col_Date_3 404 non-null object
Well, I hoped at least the date columns should be recognized as a kind of date, unfortunately not :(. Like:
df.info()
----------
col_A 404 non-null int64
col_B 404 non-null int64
col_C 404 non-null int64
Col_Date_1 404 non-null datetime64[ns]
Col_Date_2 404 non-null datetime64[ns]
Col_Date_3 404 non-null datetime64[ns]
Could someone give me a hint, how to to get the data in the right type? In my mind would it be like:
col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;NaN;60;03.02.2020;NaT;06.07.2020
126;8;NaN;03.02.2020;04.03.2020;06.07.2020
NaN;45;NaN;30.01.2020;29.02.2020;29.06.2020
106;83;189;NaT;29.02.2020;29.06.2020
NaN;12;84;30.01.2020;29.02.2020;NaT
|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57 |NaN |60 |03.02.2020 |NaT |06.07.2020|
|126 |8 |NaN |03.02.2020 |04.03.2020 |06.07.2020|
|NaN |45 |NaN |30.01.2020 |29.02.2020 |29.06.2020|
|106 |83 |189 |NaT |29.02.2020 |29.06.2020|
|NaN |12 |84 |30.01.2020 |29.02.2020 |NaT |
Do I have to iterate through all the columns and rows and clean up the "-" entities? I still on a quiet newbie level in Python and don't know what is the best solution...
Hope you guys can help me.
Upvotes: 0
Views: 116
Reputation: 14113
Replace your -
values with nan and then parse the dates
from io import StringIO
import pandas as pd
s = """col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-"""
df = pd.read_csv(StringIO(s), sep=';', na_values='-',
parse_dates=[3,4,5], dayfirst=True)
col_A col_B col_C Col_Date_1 Col_Date_2 Col_Date_3
0 57.0 NaN 60.0 2020-02-03 NaT 2020-07-06
1 126.0 8.0 NaN 2020-02-03 2020-03-04 2020-07-06
2 NaN 45.0 NaN 2020-01-30 2020-02-29 2020-06-29
3 106.0 83.0 189.0 NaT 2020-02-29 2020-06-29
4 NaN 12.0 84.0 2020-01-30 2020-02-29 NaT
Upvotes: 1