Reputation: 5097
I have a dataframe (df) that looks like:
name type cost
a apples 1
b apples 2
c oranges 1
d banana 4
e orange 6
Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:
name1, name2, status
a b 0
c e 0
Where the first 2 elements are the names where the criteria match and the third element is always a 0.
I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.
def pairListCreator(staticData):
for x, row1 in df.iterrows():
name1 = row1['name']
type1= row1['type']
for y, row2 in df.iterrows():
name2 = row['name']
type2 = row['type']
if name1<> name2 and type1 = type2:
pairList = name1,name2,0
Upvotes: 1
Views: 690
Reputation: 3504
Something like this
import pandas as pd
# Data
data = [['a', 'apples', 1],
['b', 'apples', 2],
['c', 'orange', 1],
['d', 'banana', 4],
['e', 'orange', 6]]
# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)
# Print DataFrame
print df
# Count number of rows
nr_of_rows = df.shape[0]
# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)
for i in range(nr_of_rows):
x = df.iloc[i]
for j in range(i + 1, nr_of_rows):
y = df.iloc[j]
if x['type'] == y['type']:
temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
result = result.append(temp)
# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)
# Print result
print 'result:'
print result
Output:
type cost
name
a apples 1
b apples 2
c orange 1
d banana 4
e orange 6
result:
name1 name2 status
0 a b 0.0
1 c e 0.0
Upvotes: 2
Reputation: 862511
You can use self join on column type
first, then sort values in names
column per row by apply(sorted)
.
Then remove same values in names columns by boolean indexing
, drop_duplicates
and add new column status
by assign
:
df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
.assign(status=0)
.reset_index(drop=True)
print (df)
name1 name2 status
0 a b 0
1 c e 0
Upvotes: 1