Stef
Stef

Reputation: 30679

Largest elementwise difference between all rows in dataframe

Given is the following dataframe:

      c1  c2  c3  c4
code
x      1   2   1   1
y      3   2   2   1
z      2   0   4   1

For any row in this dataframe I want to calculate the largest elementwise absolute difference between this row and all other rows of this dataframe and put it into a new dataframe:

       x   y   z
code
x      0   2   3
y      2   0   2
z      3   2   0

(the result is, of course, a triangular matrix with the main diagonal = 0 so it would be sufficient to get just either the upper or lower triangular half).

So for instance the maximum elementwise difference between rows x and y is 2 (from column c1: abs(3 - 1) = 2).

What I got so far:

df = pd.DataFrame(data={'code': ['x','y','z'], 'c1': [1, 3, 2], 'c2': [2, 2, 0], 'c3': [1,2,4], 'c4': [1,1,1]})
df.set_index('code', inplace = True)

df1 = pd.DataFrame()

for row in df.iterrows():
   df1.append((df-row[1]).abs().max(1), ignore_index = True)

When run interactively, this already looks close to what I need, but the new df1 is still empty afterwards:

>>> for row in df.iterrows(): df1.append((df-row[1]).abs().max(1),ignore_index=True)
...
     x    y    z
0  0.0  2.0  3.0
     x    y    z
0  2.0  0.0  2.0
     x    y    z
0  3.0  2.0  0.0
>>> df1
Empty DataFrame
Columns: []
Index: []

Questions:

  1. How to get the results into the new dataframe df1 (with correct index x, y, ...)?
  2. This is only a mcve. In reality, df has about 700 rows. Not sure if iterrows is so good then. I have a feeling that the apply method would come in handy here but I couldn't figure it out. So is there any more idiomatic / pandas-like way to do it without explicitely iterating over the rows?

Upvotes: 1

Views: 75

Answers (2)

heena bawa
heena bawa

Reputation: 828

If you want your code to produce correct output then you can assign the value computed to df1 again.

for row in df.iterrows():
    df1 = df1.append((df-row[1]).abs().max(1), ignore_index = True)

df1.index = df.index
print (df1)

     x    y    z
X  0.0  2.0  3.0
y  2.0  0.0  2.0
z  3.0  2.0  0.0

Upvotes: 0

jpp
jpp

Reputation: 164843

You can use NumPy and feed an array to the pd.DataFrame constructor. For a small number of rows, as in your data, this should be efficient.

A = df.values
res = pd.DataFrame(np.abs(A - A[:, None]).max(2),
                   index=df.index, columns=df.index.values)

print(res)

      x  y  z
code         
x     0  2  3
y     2  0  2
z     3  2  0

Upvotes: 1

Related Questions