Reputation: 23
I want to replicate something similar to the "Multi Row Formula" tool in Alteryx. I am currently reading a csv file and would like to be able to set a value in a list if the value of another row in the same tested column is true.
Sample Data
**Country**
China
India
Brazil
Indonesia
When it loops through the row containing Brazil, "Y" should be appended into the new list since China is two rows above. "N" should be appended for the rest.
import pandas as pd
csv_in = pd.read_csv('C:/sample.csv')
kind = []
for row in csv_in['Country']:
if ***two rows above this row*** == 'China':
kind.append('Y')
elif ***one row below this row*** == 'Canada':
kind.append('Y')
else:
kind.append("N")
csv_in['Result'] = kind
I am having trouble finding anything specific to this issue. Any help would be appreciated!
EDIT: I realized I need to do a little more in addition to what I asked initially.
for row in csv_in['Country']:
if 'hina' in ***two rows above this row***:
kind.append('Y')
elif ***one row below this row***.startswith('Can'):
kind.append('X')
else:
kind.append("N")
Upvotes: 2
Views: 4456
Reputation: 33793
Use shift
to build a Boolean array of where the 'Y'
values should appear, then use numpy.where
to create the column:
import numpy as np
y_cond = (csv_in.shift(2) == 'China') | (csv_in.shift(-1) == 'Canada')
csv_in['Result'] = np.where(y_cond, 'Y', 'N')
If you have more than one column in your DataFrame, you'll need to use csv_in['Country'].shift()
instead of the shorter notation in the code above.
The resulting output on some slightly expanded sample data:
Country Result
0 China N
1 India N
2 Brazil Y
3 Indonesia N
4 Bhutan N
5 Mexico Y
6 Canada N
7 Peru N
8 Honduras N
EDIT:
If you want to assign non-binary values, I'd take a slightly different approach.
Begin by initializing the results as 'N'
. For each condition, create a Boolean array similarly as before, and use loc
to assign the desired value. Do this in reverse order of importance, as subsequent matches will overwrite previous ones.
Note that you can use the .str
accessor to apply string functions to a column, as described in the Working with Text Data section of the documentation.
csv_in['Result'] = 'N'
x_cond = csv_in['Country'].shift(-1).str.startswith('Can').fillna(False)
csv_in.loc[x_cond, 'Result'] = 'X'
y_cond = csv_in['Country'].shift(2).str.contains('hina').fillna(False)
csv_in.loc[y_cond, 'Result'] = 'Y'
The .fillna(False)
is necessary as loc
needs purely Boolean values, and shift
introduces NaN
values. If you really want to write your conditions in order of importance, you could do something like x_cond & (csv_in['Result'] == 'N')
inside of loc
, although it might hinder performance.
Updated output:
Country Result
0 China N
1 India N
2 Brazil Y
3 Indonesia N
4 Bhutan N
5 Mexico X
6 Canada N
7 Peru N
8 Honduras N
Upvotes: 3