Reputation: 1568
I need to create a new column indicator
that takes the value of code
. I cannot use if else statements, because code
can have a value and the same value somewhere else in the column of second code
. Any help would be appreciated.
import pandas as pd
df = pd.DataFrame({
'date': ['2019-04-19','2019-04-20','2019-05-03', '2019-05-04',
'2019-10-01','2019-10-07','2019-10-11', '2019-11-20'],
'category': ['ID F', 'ID F', 'ID F', 'ID F',
'ID B', 'ID B', 'ID B', 'ID B'],
'code': ['None', 'None', 'None', '8008',
'None', 'None', '9001', 'None'],
'second_code': ['None', 'None', '8008', 'None',
'None', 'None', 'None', 'None']})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date category code second_code
0 2019-04-19 ID F None None
1 2019-04-20 ID F None None
2 2019-05-03 ID F None 8008
3 2019-05-04 ID F 8008 None
4 2019-10-01 ID B None None
5 2019-10-07 ID B None None
6 2019-10-11 ID B 9001 None
7 2019-11-20 ID B None None
Desired DataFrame
date category code second_code indicator
0 2019-04-19 ID F None None None
1 2019-04-20 ID F None None None
2 2019-05-03 ID F None 8008 8008
3 2019-05-04 ID F 8008 None None
4 2019-10-01 ID B None None None
5 2019-10-07 ID B None None None
6 2019-10-11 ID B 9001 None 9001
7 2019-11-20 ID B None None None
Edit:
I have tried:
df['indicator'] = np.where(df['new_code'].notnull(), df['new_code'], np.where(df['code']).notnull(), df['code'], np.nan)
but am getting the error AttributeError: 'tuple' object has no attribute 'notna'
Upvotes: 0
Views: 223
Reputation: 66
Try this one.
import pandas as pd
df = pd.DataFrame({
'date': ['2019-04-19','2019-04-20','2019-05-03', '2019-05-04',
'2019-10-01','2019-10-07','2019-10-11', '2019-11-20'],
'category': ['ID F', 'ID F', 'ID F', 'ID F',
'ID B', 'ID B', 'ID B', 'ID B'],
'code': ['None', 'None', 'None', '8008',
'None', 'None', '9001', 'None'],
'second_code': ['None', 'None', '8008', 'None',
'None', 'None', 'None', 'None']})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['indicator'] = df['code']
print(df)
output.
date category code second_code indicator
0 2019-04-19 ID F None None None
1 2019-04-20 ID F None None None
2 2019-05-03 ID F None 8008 None
3 2019-05-04 ID F 8008 None 8008
4 2019-10-01 ID B None None None
5 2019-10-07 ID B None None None
6 2019-10-11 ID B 9001 None 9001
7 2019-11-20 ID B None None None
Read more here https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html, the documentation has everything.
Upvotes: 0
Reputation: 13537
Easiest way should be to use the fillna
function along with the drop_duplicates
to ensure we get the first time a specific code shows up in the column.
# First we need to replace the "None" with actual NaN values
df = df.replace("None", np.nan)
df["indicator"] = df["second_code"].fillna(df["code"]).drop_duplicates()
df
date category code second_code indicator
0 2019-04-19 ID F NaN NaN NaN
1 2019-04-20 ID F NaN NaN NaN
2 2019-05-03 ID F NaN 8008 8008
3 2019-05-04 ID F 8008 NaN NaN
4 2019-10-01 ID B NaN NaN NaN
5 2019-10-07 ID B NaN NaN NaN
6 2019-10-11 ID B 9001 NaN 9001
7 2019-11-20 ID B NaN NaN NaN
Essentially I'm telling pandas to: take the column "second_code," and fill in any missing values (NaN) in "second_code" with the values from "code". Then, get rid of any potential duplicate entries from the aforementioned operation and assign this result to the "indicator" column.
Upvotes: 1