Reputation: 1585
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3
. That is because the minimum value of df['B']
is 10
and three values from df['A']
are higher than 10
, so the output is 3
.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
Upvotes: 2
Views: 539
Reputation: 9047
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
Upvotes: 1
Reputation: 117771
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
Upvotes: 1