smallcat31
smallcat31

Reputation: 344

function across 2 dataframes based on index (python)

i have 2 dataframes A and B and was thinking how do i create the dataframe in orange

Values to be populated for each cell would be based on the column and header. For example: the top left cell would be a func based on the row and column index (dataframe A.A0 + dataframe A.A1 - dataframe B.0)

enter image description here

i tried with an empty dataframe of the orange dimensions (emptyDF)

emptyDf.applyMap(lambda x: x[dfA[0]] + x[dfA[1] - x[dfB[0]]]

Upvotes: 1

Views: 74

Answers (1)

cardamom
cardamom

Reputation: 7421

What you are trying to do is not in the spirit of the uses of the Pandas dataframe, but it is more a matrix manipulation exercise for which NumPy is more appropriate, the library upon which Pandas is built. It is not hard to move between Pandas dataframes and NumPy arrays and back again, you might need to be careful though to store indexes and column labels somewhere safe to use when you bring it back into pandas. There are all kinds of NumPy functions to do any manipulation you could dream up, I found a few tools to help this application:

import pandas as pd
import numpy as np

# create your dataframes:

series = pd.Series([10,9,8,7,6], index=[0,1,2,3,4])
df1 = pd.DataFrame([series])

cols = ['A','B','C','D']
list_of_series = [pd.Series([1,2,3,4],index=cols), pd.Series([5,6,7,8],index=cols)]
df2 = pd.DataFrame(list_of_series, columns=cols)

Now convert to NumPy

A = np.array(df2)
>>> A
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

B = np.array(df1)
>>> B.T
array([[10],
       [ 9],
       [ 8],
       [ 7],
       [ 6]])

Now a few NumPy operations to accomplish the task:

C = A.sum(axis=0)
D = np.tile(C,(5,1))
E = np.tile(B.T, (1,4))
F = D - E
F
array([[-4, -2,  0,  2],
       [-3, -1,  1,  3],
       [-2,  0,  2,  4],
       [-1,  1,  3,  5],
       [ 0,  2,  4,  6]])

Now convert it back to a dataframe:

pd.DataFrame(F, columns=['A','B','C','D'], index=[0,1,2,3,4])

enter image description here

Anyway, I wonder if this can work directly from Pandas, but it just strikes me as a matrix issue, and it terms of computation time for a large system as this is staying within NumPy I don't think it would be slow.

Upvotes: 1

Related Questions