Reputation: 737
I have a csv file with the following sample output:
3/12/1970
3/1/1942
10/20/1945 10/20/1945
10/27/1960
10/5/1952
I bring it into pandas with df = pd.read_csv(filename).
I know there are rows with double dates as noted above. The dtype of this column is object in pandas. When trying to convert this column to datetime format in pandas, I get errors on all the rows with this double date issue and have to find and edit them in the csv, one by one. So, I have tried the following to clean out all the rows in my 50K rows which have this double date issue:
df[col] = df[col].str.strip()
df[col] = df[col].str[:10]
Does not affect any of the double dates at all.
I also tried to calculate the length of each value in the col and then simply remove date values if the resulting col length exceeds 10. Still, the double date rows remain.
I have also tried the following to even locate this particular row to inspect it further, but this code results in 0 rows.
bad_dates = df[df[col].str.contains('10/20/1945')]
So, any creative ideas to clean these double dates? (It happens with probably one hundred randomly distributed column values)
Upvotes: 0
Views: 162
Reputation: 101
you can use split to do this.
split()
splits each row (str)
into a list of values split by spaces then [-1] selects the last value only, this eliminates all the extra values and retains only single value as you need.
df['col'].apply(lambda x: x.split()[-1])
Upvotes: 1
Reputation: 2696
With the test file
col
3/12/1970
3/1/1942
10/20/1945 10/20/1945
10/27/1960
10/5/1952
saved as /project/test/test.csv
import pandas as pd
df = pd.read_csv('~/project/test/test.csv')
gives
col
0 3/12/1970
1 3/1/1942
2 10/20/1945 10/20/1945
3 10/27/1960
4 10/5/1952
Then your example
df['fixed'] = df['col'].str[:10]
Gives
col fixed
0 3/12/1970 3/12/1970
1 3/1/1942 3/1/1942
2 10/20/1945 10/20/1945 10/20/1945
3 10/27/1960 10/27/1960
4 10/5/1952 10/5/1952
Basically this worked and something about your case is different and is causing the difference.
Upvotes: 0