Reputation: 35
Starting DataFrame
df = pd.DataFrame({'Column A' : ['red','green','yellow', 'orange', 'red', 'blue'],
'Column B' : [NaN, 'blue', 'purple', NaN, NaN, NaN],
'Column C' : [1, 2, 3, 2, 3, 7]})
Column A | Column B | Column C |
---|---|---|
'red' | NaN | 1 |
'green' | 'blue' | 2 |
'yellow' | 'purple' | 3 |
'orange' | NaN | 2 |
'red' | NaN | 3 |
'blue' | NaN | 7 |
Desired Result
Column A | Column B | Column C |
---|---|---|
'red' | NaN | 1 |
'blue' | 'blue' | 2 |
'purple' | 'purple' | 3 |
'orange' | NaN | 2 |
'red' | NaN | 3 |
'blue' | NaN | 7 |
I want to replace values in column A only if the value in Column B is not NaN, and to replace column A with the value in Column B
So that I can run the following code:
df[[Column_A, Column_C]].groupby(Column_A).sum()
Which would result in the following DataFrame:
Column A | Column C |
---|---|
'red' | 4 |
'blue' | 9 |
'purple' | 3 |
'orange' | 2 |
I am trying to replace categories before doing a groupby
call.
Attempts:
The DataFrame I am working with has a sequential numerical based index going from 0 to N.
So I could hard code the following:
df.iloc[[index], column] = some_string
I do not want to do this as it is not dynamic and the DataFrame data could change.
I believe I could use .agg()
or .apply()
on either the df
or the df.groupby()
but this is where I have struggled.
Particularly with how to write a function to use with .agg()
or .apply()
Say:
def my_func(x):
print(x)
Then:
df.apply(my_func)
The result is the first column of df
printed.
Or:
df.apply(my_func, axis = 1)
The result is the following format for each row:
Column A red
Column B Nan
Column C 1
Name: 0, dtype: object
Column A green
Column B blue
Column C 2
Name: 1, dtype: object
I am not sure how to access each column per row in my_func
.
Edit:
I am trying to find a way to change the value in Column A if the value, for that row, in Column B is not NaN. The value to use for replacing is the value in Column B, the value to replace is the value in Column A if Column B is not NaN.
But I want to do this dynamically, meaning not hardcoded as I showed with:
df.iloc[[index], column] = some_string
Upvotes: 1
Views: 1149
Reputation: 895
As you mentioned, you could use pd.apply
like this:
df['Column A'] = df.apply(lambda x: x['Column B'] if str(x['Column B']) not in ['nan', 'NaN'] else x['Column A'], axis=1)
Column A Column B Column C
0 red NaN 1
1 blue blue 2
2 purple purple 3
3 orange NaN 2
4 red NaN 3
5 blue NaN 7
Notice that apply is not fast at for very large dataset is not advisable. There are some good answers out there for alternative methods
Upvotes: 1