evolk
evolk

Reputation: 125

Pandas - convert dataframe to square matrix with pairwise combinations from index

I am converting a data frame to a square matrix. The data frame has an index and only one column with floats. What I need to do is to calculate all pairs of indices, and for each pair take the mean of two associated column values. So, the usual pivot function is only part of the solution.

Currently, the function has an estimated complexity of O(n^2), which is not good as I have to work with larger inputs with data frames with several hundred rows at a time. Is there another faster approach I could take?

Example input (with integers here for simplicity):

df = pd.DataFrame([3, 4, 5])

Update: transformation logic

For an input data frame in the example:

   0

0  3
1  4
2  5

I do the following (not claiming it is the best way though):

The code is in the turn_table_into_square_matrix().

Desired output:

    0   1   2

0   0.0 3.5 4.0
1   3.5 0.0 4.5
2   4.0 4.5 0.0

Current implementation:

import pandas as pd
from itertools import combinations 
import time
import string
import random


def turn_table_into_square_matrix(original_dataframe):

    # get all pairs of indices 
    index_pairs = list(combinations(list(original_dataframe.index),2))

    rows_for_final_dataframe = []

    # collect new data frame row by row - the time consuming part
    for pair in index_pairs:
        subset_original_dataframe = original_dataframe[original_dataframe.index.isin(list(pair))]
        rows_for_final_dataframe.append([pair[0], pair[1], subset_original_dataframe[0].mean()])
        rows_for_final_dataframe.append([pair[1], pair[0], subset_original_dataframe[0].mean()])

    final_dataframe = pd.DataFrame(rows_for_final_dataframe)

    final_dataframe.columns = ["from", "to", "weight"]
    final_dataframe_pivot = final_dataframe.pivot(index="from", columns="to", values="weight")
    final_dataframe_pivot = final_dataframe_pivot.fillna(0)

    return final_dataframe_pivot

Code to time the performance:

for size in range(50, 600, 100):

    index = range(size)
    values = random.sample(range(0, 1000), size)
    example = pd.DataFrame(values, index)

    print ("dataframe size", example.shape)

    start_time = time.time()
    turn_table_into_square_matrix(example)
    print ("conversion time:", time.time()-start_time)

The timing results:

dataframe size (50, 1)
conversion time: 0.5455281734466553

dataframe size (150, 1)
conversion time: 5.001590013504028

dataframe size (250, 1)
conversion time: 14.562285900115967

dataframe size (350, 1)
conversion time: 31.168692111968994

dataframe size (450, 1)
conversion time: 49.07127499580383

dataframe size (550, 1)
conversion time: 78.73740792274475

Thus, a data frame of with 50 rows takes only half a second to convert, whereas one with 550 rows (11 times longer) takes 79 seconds (over 11^2 times longer). Is there a faster solution to this problem?

Upvotes: 5

Views: 5629

Answers (3)

How about this:

df.pivot(index='i', columns = 'j', values = 'empty')

for this you need to cheat a bit the standard pivot with adding new index columns (twice) as it does not allow the same argument twice in pivot and adding an empty column for values:

df['i']=df.index
df['j']=df.index
df['empty']=None

And that's it.

Upvotes: 0

choucavalier
choucavalier

Reputation: 2770

I don't think it is possible to do better than O(n^2) for that computation. As @piiipmatz suggested, you should try doing everything with numpy and then put the result in a pd.DataFrame. Your problem sounds like a good use case for numpy.add.at.

Here is a quick example

import numpy as np
import itertools

# your original array
x = np.array([1, 4, 8, 99, 77, 23, 4, 45])
n = len(x)
# all pairs of indices in x
a, b = zip(*list(itertools.product(range(n), range(n))))
a, b = np.array(a), np.array(b)
# resulting matrix
result = np.zeros(shape=(n, n))

np.add.at(result, [a, b], (x[a] + x[b]) / 2.0)

print(result)
# [[  1.    2.5   4.5  50.   39.   12.    2.5  23. ]
# [  2.5   4.    6.   51.5  40.5  13.5   4.   24.5]
# [  4.5   6.    8.   53.5  42.5  15.5   6.   26.5]
# [ 50.   51.5  53.5  99.   88.   61.   51.5  72. ]
# [ 39.   40.5  42.5  88.   77.   50.   40.5  61. ]
# [ 12.   13.5  15.5  61.   50.   23.   13.5  34. ]
# [  2.5   4.    6.   51.5  40.5  13.5   4.   24.5]
# [ 23.   24.5  26.5  72.   61.   34.   24.5  45. ]]

Upvotes: 4

piiipmatz
piiipmatz

Reputation: 400

I think you have a lot of overhead from pandas (i.e. original_dataframe[original_dataframe.index.isin(list(pair))] seems too expensive for what it actually does). I haven't tested it but I assume you can save a considerable amount of execution time when you just work with numpy arrays. If needed you can still feed it to a pandas.DataFrame at the end.

Something like (just to sketch what I mean):

original_array = original_dataframe.as_matrix().ravel()
n = len(original_array)
final_matrix = np.zeros((n,n))

for pair in pairs:
    final_matrix[pair[0], pair[1]] = 0.5*(original_array[pair[0]]+original_array[pair[1]])

Upvotes: 1

Related Questions