Reputation: 753

How to add a new column to a Pandas Dataframe based on data both in each row, and on the existence of other rows that meet a specific criteria?

I'm looking to take my existing DF with a number of columns and perform the following operation:

For each originalRow in the DF
check if another row exists where: 
row.col1 = originalRow.col1,
 and 
row.col2 != originalRow.col2

Is there a way to do this gracefully in Python / Pandas?

I've looked into using the .where operator, but the issue I run into is that my conditions are checking one rows column values against another rows values. (Something like this answer: https://stackoverflow.com/a/43481338/3757782, but for two different rows)

Something like:

df["New Col"] = np.where(((df["col1"] == df["col1*"]) && (df["col2"] != df["col2*"])), 1, 0)

Except that col1* has to be another row from col1, etc. Let me know if this question doesn't make sense. I'm hoping that this is something that isn't that hard, and I'm just missing what the standard way to do this is.

Thanks!

Example Data:

df = 
Col1, Col2
a, 1
a, 2
b, 1
b, 1
c, 1
c, 2
c, 3
d, 1

Expected Output:

df =
Col1, Col2, newCol
a, 1, 1
a, 2, 1
b, 1, 0
b, 1, 0
c, 1, 1
c, 2, 1
c, 3, 1
d, 1, 0

the two A rows get 1 (true) because another row exists for each of those where col1 = col1* and col2 != col2*

the two B rows get 0 (false) because they don't meet the condition

the three C rows get 1 (true) for the same reason as the A rows

and D gets 0 (false) as no other D row exists

Upvotes: 0

Answers (2)

wwnde

Reputation: 26676

Lets try

df['newCol']=np.where((df.Col1.eq(df.Col1.shift(-1))|df.Col1.shift(1).eq(df.Col1))&df.Col2.ne(df.Col2.shift(-1)),1,0)



 Col1  Col2  newCol
0    a     1       1
1    a     2       1
2    b     1       0
3    b     1       0
4    c     1       1
5    c     2       1
6    c     3       1
7    d     1       0

Upvotes: 1

tgrandje

Reputation: 2534

How about this ?

cols = df.columns.tolist()
df['COUNT'] = 1

count = df.groupby(cols)['COUNT'].sum()
ix = df[df['COUNT']>1].index
count.loc[ix, 'COUNT'] = 0

df = df.merge(count, on=cols, how='left')

EDIT :

cols = ['col1', 'col2']
df['COUNT'] = 1

count = df.groupby(cols)['COUNT'].sum()
count = count.reset_index(drop=False)
ix = df[df['COUNT']>1].index
count.loc[ix, 'COUNT'] = 0

df = df.merge(count, on=cols, how='left')

count_col1 = df[['col1', 'COUNT']].copy()
count_col1 = df[['col1', 'COUNT']].groupby("col1")['COUNT'].sum()
count_col1 = count_col1.reset_index(drop=False)
count_col1.rename({'COUNT':'COUNT_COL1'}, axis=1, inplace=True)
ix = count_col1[count_col1.COUNT_COL1>1].index
count_col1.drop(ix, inplace=True)

df = df.merge(count_col1, on='col1', how='left')
ix = df[df.COUNT_COL1.notnull()].index
df.loc[ix, 'COUNT'] = 0
df.drop("COUNT_COL1", inplace=True, axis=1)

Upvotes: 1

How to add a new column to a Pandas Dataframe based on data both in each row, and on the existence of other rows that meet a specific criteria?

Answers (2)

Related Questions