Reputation:
Trying to convert data (string format from a csv file) to boolean (in a dataframe), I have 'lost' information on their original value, so now all the values are boolean False
.
The columns that I am trying to change into boolean are the following:
df['Col1'] =df['Col1'].astype('bool')
df['Col2'] =df['Col2'].astype('bool')
I have also tried with
df.Col1 = np.where(df.Col1.eq('true'), True, False)
df. Col2 = np.where(df.Col2.eq('true') | df.Col2.eq('tbc'), True, False)
The unique values for each column, Col1 and Col2 are:
Col1: array([true, false, nan], dtype=object)
Col2: array(['true', 'false', 'tbc', nan], dtype=object)
My original dataset has the following values.
Col1 Col2
true true
true true
false false
nan false
false true
true tbc
Though they were converted into boolean, all the values are False:
Col1 Col2
False False
False False
False False
False False
False False
False False
I would like to treat TBC as True. Why am I getting only False values? Any idea on how I could fix it?
Sample of original dataset and code:
Date Checked Verified
2018-05-23 FALSE TRUE
2018-05-24 TRUE TBC
2018-05-26 FALSE TBC
2018-05-31 nan nan
2019-12-01 TRUE TRUE
2019-12-05 TRUE TBC
2019-12-15 TRUE FALSE
2019-12-23 FALSE nan
Code
df=pd.read_csv(path, sep=';', engine='python')
df= df.apply(lambda x: x.astype(str).str.lower())
df['Checked'] = np.where(df['Checked'].eq('true'), True, False)
df['Verified'] = np.where(df['Verified'].eq('true') | df['Verified'].eq('tbc'), True, False)
Then I test how many rows have value Checked = True:
len(df[df['Checked']=='true'])
output: 153
Convert to boolean Checked:
df['Checked'] = np.where(df['Checked'].eq('true'), True, False)
len(df[df['Checked']==True])
output: 153
Convert Verified
to boolean:
df['Verified'] = np.where(df['Verified'].eq('true') | df['Verified'].eq('tbc'), True, False)
len(df[df['Verified']==True])
output: 0
(expected 60
)
Upvotes: 1
Views: 422
Reputation: 831
For me, this works perfectly. Since you haven't given the exact csv format, I assumed it would be like this:
Date;Checked;Verified
2018-05-23;FALSE;TRUE
2018-05-24;TRUE;TBC
2018-05-26;FALSE;TBC
2018-05-31;nan;nan
2019-12-01;TRUE;TRUE
2019-12-05;TRUE;TBC
2019-12-15;TRUE;FALSE
2019-12-23;FALSE;nan
Then I called the code, like you did:
df=pd.read_csv(path, sep=';', engine='python')
df=df.apply(lambda x: x.astype(str).str.lower())
df['Checked'] = np.where(df['Checked'].eq('true'), True, False)
df['Verified'] = np.where(df['Verified'].eq('true') | df['Verified'].eq('tbc'), True, False)
The resulting dataframe looks like this:
Date Checked Verified
0 2018-05-23 False True
1 2018-05-24 True True
2 2018-05-26 False True
3 2018-05-31 False False
4 2019-12-01 True True
5 2019-12-05 True True
6 2019-12-15 True False
7 2019-12-23 False False
Which python and pandas version do you use? I tested this with Python 3.6.1 and Pandas 1.04
Upvotes: 0
Reputation: 402303
You can do this for both columns via regular expressions and df.replace
:
df.astype(str).replace({'(?i)True|TBC': True, '(?i)False|nan': False}, regex=True)
Col1 Col2
0 True True
1 True True
2 False False
3 False False
4 False True
5 True True
The pattern is case insensitive.
Upvotes: 1