Stacey
Stacey

Reputation: 5097

Match values in dataframe rows

I have a dataframe (df) that looks like:

name  type    cost
a     apples  1
b     apples  2
c     oranges 1 
d     banana  4
e     orange  6

Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:

name1, name2, status
    a      b       0
    c      e       0

Where the first 2 elements are the names where the criteria match and the third element is always a 0.

I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.

def pairListCreator(staticData):
    for x, row1 in df.iterrows():

        name1 = row1['name']
        type1= row1['type']

        for y, row2 in df.iterrows():
            name2 = row['name']
            type2 = row['type']

            if name1<> name2 and  type1 = type2:
                pairList = name1,name2,0

Upvotes: 1

Views: 690

Answers (2)

Elmex80s
Elmex80s

Reputation: 3504

Something like this

import pandas as pd

# Data
data = [['a', 'apples', 1],
        ['b', 'apples', 2],
        ['c', 'orange', 1],
        ['d', 'banana', 4],
        ['e', 'orange', 6]]

# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)

# Print DataFrame
print df

# Count number of rows
nr_of_rows = df.shape[0]

# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)

for i in range(nr_of_rows):
    x = df.iloc[i]

    for j in range(i + 1, nr_of_rows):
        y = df.iloc[j]

        if x['type'] == y['type']:
            temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
            result = result.append(temp)

# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)

# Print result
print 'result:'
print result

Output:

        type  cost
name              
a     apples     1
b     apples     2
c     orange     1
d     banana     4
e     orange     6
result:
  name1 name2  status
0     a     b     0.0
1     c     e     0.0

Upvotes: 2

jezrael
jezrael

Reputation: 862511

You can use self join on column type first, then sort values in names column per row by apply(sorted).

Then remove same values in names columns by boolean indexing, drop_duplicates and add new column status by assign:

df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
                             .assign(status=0)
                             .reset_index(drop=True)
print (df)
  name1 name2  status
0     a     b       0
1     c     e       0

Upvotes: 1

Related Questions