Reputation: 3
I need to do a large number of row level operations (a few pages of code) on a table of data.
E.g. if row.Col_A == 'X' : row.Col_B = 'Y'
I believe iterrows isn't appropriate for altering table values. So I've converted the table to a list of DotMap dictionaries. With this I can loop over the list and for each dictionary (row), write the code as above and the alterations are saved.
Is it possible to do this with the data as a DataFrame ?
There is a lot of logic and I think its clearest written this way so I'd prefer not to use map or apply functions.
Upvotes: 0
Views: 1156
Reputation: 449
Let's have the following example dataframe:
import pandas as pd
import numpy as np
some_data = pd.DataFrame({
'col_a': [1, 2, 1, 2, 3, 4, 3, 4],
'col_b': ['a', 'b', 'c', 'c', 'a', 'b', 'z', 'z']
})
We want to create a new column based on one (or more) of the existing columns' values.
In case you have only two options, I would suggest using numpy.where like this:
some_data['np_where_example'] = np.where(some_data.col_a < 3, 'less_than_3', 'greater_than_3')
print(some_data)
>>>
col_a col_b col_c map_example np_where_example \
0 1 a less_than_3 NaN less_than_3
1 2 b less_than_3 BBB less_than_3
2 1 c less_than_3 NaN less_than_3
3 2 c less_than_3 NaN less_than_3
4 3 a greater_than_3 NaN greater_than_3
5 4 b greater_than_3 BBB greater_than_3
6 3 z greater_than_3 ZZZ greater_than_3
7 4 z greater_than_3 ZZZ greater_than_3
# multiple conditions
some_data['np_where_multiple_conditions'] = np.where(((some_data.col_a >= 3) & (some_data.col_b == 'z')),
'is_true',
'is_false')
print(some_data)
>>>
col_a col_b np_where_multiple_conditions
0 1 a is_false
1 2 b is_false
2 1 c is_false
3 2 c is_false
4 3 a is_false
5 4 b is_false
6 3 z is_true
7 4 z is_true
In case you have many options, then pandas.map would be better:
some_data['map_example'] = some_data.col_b.map({
'b': 'BBB',
'z': 'ZZZ'
})
print(some_data)
>>>
col_a col_b map_example
0 1 a NaN
1 2 b BBB
2 1 c NaN
3 2 c NaN
4 3 a NaN
5 4 b BBB
6 3 z ZZZ
7 4 z ZZZ
As you see, in all cases the values for which a condition is not specified evaluate to NaN
.
Upvotes: 1
Reputation: 148880
A possible way to iterate over a dataframe by rows and change column values is:
make sure that there are no duplicated values in index (if there are, just use reset_index
to get an acceptable index)
iterate over the index and access the individual values with at
for ix in df.index:
if df.at[ix, 'A'] == ...:
df.at[ix, 'B'] = z
Alternatively, if you can access the columns by their positions instead of their names, you can use the even more efficient iat
:
for i in range(len(df)):
if df.iat[i, index_col_A] == ... :
df.iat[i, index_col_B] = z
As you access directly the individual elements, you avoid the overhead of iterrows
creating a Series per row, and can perform changes. AFAIK, it is the less bad way when you cannot use the vectorized Pandas or numpy methods.
Upvotes: 0
Reputation: 867
You can use the apply function with a lambda in the following way:
df['Col_B'] = df['Col_A'].apply(lambda a: 'Y' if a == 'X' else 'N')
This creates the column Col_B on the dataframe df by looking at Col_A and giving either the values 'Y' if Col_A is 'X' and 'N' otherwise.
if your function is a bit more complex you can define it beforehand and call it in the apply function as follows:
def yes_or_no(x):
if x == 'X':
return 'Y'
else:
return 'N'
df['Col_B'] = df['Col_A'].apply(lambda a: yes_or_no(a))
Upvotes: 0