Reputation: 125
I am converting a data frame to a square matrix. The data frame has an index and only one column with floats. What I need to do is to calculate all pairs of indices, and for each pair take the mean of two associated column values. So, the usual pivot function is only part of the solution.
Currently, the function has an estimated complexity of O(n^2), which is not good as I have to work with larger inputs with data frames with several hundred rows at a time. Is there another faster approach I could take?
Example input (with integers here for simplicity):
df = pd.DataFrame([3, 4, 5])
Update: transformation logic
For an input data frame in the example:
0
0 3
1 4
2 5
I do the following (not claiming it is the best way though):
The code is in the turn_table_into_square_matrix().
Desired output:
0 1 2
0 0.0 3.5 4.0
1 3.5 0.0 4.5
2 4.0 4.5 0.0
Current implementation:
import pandas as pd
from itertools import combinations
import time
import string
import random
def turn_table_into_square_matrix(original_dataframe):
# get all pairs of indices
index_pairs = list(combinations(list(original_dataframe.index),2))
rows_for_final_dataframe = []
# collect new data frame row by row - the time consuming part
for pair in index_pairs:
subset_original_dataframe = original_dataframe[original_dataframe.index.isin(list(pair))]
rows_for_final_dataframe.append([pair[0], pair[1], subset_original_dataframe[0].mean()])
rows_for_final_dataframe.append([pair[1], pair[0], subset_original_dataframe[0].mean()])
final_dataframe = pd.DataFrame(rows_for_final_dataframe)
final_dataframe.columns = ["from", "to", "weight"]
final_dataframe_pivot = final_dataframe.pivot(index="from", columns="to", values="weight")
final_dataframe_pivot = final_dataframe_pivot.fillna(0)
return final_dataframe_pivot
Code to time the performance:
for size in range(50, 600, 100):
index = range(size)
values = random.sample(range(0, 1000), size)
example = pd.DataFrame(values, index)
print ("dataframe size", example.shape)
start_time = time.time()
turn_table_into_square_matrix(example)
print ("conversion time:", time.time()-start_time)
The timing results:
dataframe size (50, 1)
conversion time: 0.5455281734466553
dataframe size (150, 1)
conversion time: 5.001590013504028
dataframe size (250, 1)
conversion time: 14.562285900115967
dataframe size (350, 1)
conversion time: 31.168692111968994
dataframe size (450, 1)
conversion time: 49.07127499580383
dataframe size (550, 1)
conversion time: 78.73740792274475
Thus, a data frame of with 50 rows takes only half a second to convert, whereas one with 550 rows (11 times longer) takes 79 seconds (over 11^2 times longer). Is there a faster solution to this problem?
Upvotes: 5
Views: 5629
Reputation: 1220
How about this:
df.pivot(index='i', columns = 'j', values = 'empty')
for this you need to cheat a bit the standard pivot
with adding new index columns (twice) as it does not allow the same argument twice in pivot and adding an empty column for values:
df['i']=df.index
df['j']=df.index
df['empty']=None
And that's it.
Upvotes: 0
Reputation: 2770
I don't think it is possible to do better than O(n^2)
for that computation. As @piiipmatz suggested, you should try doing everything with numpy and then put the result in a pd.DataFrame
. Your problem sounds like a good use case for numpy.add.at
.
Here is a quick example
import numpy as np
import itertools
# your original array
x = np.array([1, 4, 8, 99, 77, 23, 4, 45])
n = len(x)
# all pairs of indices in x
a, b = zip(*list(itertools.product(range(n), range(n))))
a, b = np.array(a), np.array(b)
# resulting matrix
result = np.zeros(shape=(n, n))
np.add.at(result, [a, b], (x[a] + x[b]) / 2.0)
print(result)
# [[ 1. 2.5 4.5 50. 39. 12. 2.5 23. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 4.5 6. 8. 53.5 42.5 15.5 6. 26.5]
# [ 50. 51.5 53.5 99. 88. 61. 51.5 72. ]
# [ 39. 40.5 42.5 88. 77. 50. 40.5 61. ]
# [ 12. 13.5 15.5 61. 50. 23. 13.5 34. ]
# [ 2.5 4. 6. 51.5 40.5 13.5 4. 24.5]
# [ 23. 24.5 26.5 72. 61. 34. 24.5 45. ]]
Upvotes: 4
Reputation: 400
I think you have a lot of overhead from pandas (i.e. original_dataframe[original_dataframe.index.isin(list(pair))]
seems too expensive for what it actually does). I haven't tested it but I assume you can save a considerable amount of execution time when you just work with numpy arrays. If needed you can still feed it to a pandas.DataFrame at the end.
Something like (just to sketch what I mean):
original_array = original_dataframe.as_matrix().ravel()
n = len(original_array)
final_matrix = np.zeros((n,n))
for pair in pairs:
final_matrix[pair[0], pair[1]] = 0.5*(original_array[pair[0]]+original_array[pair[1]])
Upvotes: 1