Pandas dataframe compare all column values of index without column-name reference

Question

I have an indexed dataframe that contains many columns, some examples:

    Feature1
    Feature2
    Feature3
    Feature4
....

What i simply want to implement a function, to create a new dataframe(or another data-structure type) object which will compare values of one test-sample row values with all other rows(test sample included), if values are equal; comparison result will be "1" else "0", but as i have 91 columns, i don't want to give reference to column names, i've seen many examples that column names are given to some pandas functions.

Data example for classified_data object(NaN means null)

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        yes         IN    NaN
SRI3015  1        yes         IN    NaN
SRS3012  1        no          OUT   Val1

I've simply tried :

##Choose sample
    test_sample = classified_data.sample();
#Find index of random sample
    test_product_code = list(test_sample.index.values)[0]
##Find location of random product in data-set
    test_index = classified_data.index.get_loc(test_product_code)
    #print(test_sample);
    #print(classified_data[(test_index):(test_index+1)])
    enum_similarity_data = pandas.DataFrame(calculate_similarity_for_categorical(classified_data[(test_index):(test_index+1)],classified_data).T,index=classified_data.index)


def calculate_similarity_for_categorical(value1,value2):
    if(value1 == value2):
        return 1;
    else:
        return 0;

Desired output for SRI3012(assumption to be choosen randomly) a dataframe or another object having column names and values:

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        1        1        1
SRI3015  1        1        1        1
SRS3012  1        0        0        0

ALollz · Accepted Answer

`DataFrame.eq`

You can check equality of one row with all other rows specifying axis=1. The comparison here should be DataFrame.eq(Series, axis=1) If you consider NaN == NaN to be True (which is not the standard) we need to deal with that separately.

import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'A', 'B', 'C', np.NaN], ['A', 'A', 'B', 'C', np.NaN], 
                   ['A', 'X', 'Z', 'C', np.NaN], [6, 'foo', 'bar', 12, 1231.1]])
#   0    1    2   3       4
#0  A    A    B   C     NaN
#1  A    A    B   C     NaN
#2  A    X    Z   C     NaN
#3  6  foo  bar  12  1231.1

s = df.iloc[0]  # or df.iloc[np.random.choice(range(df.shape[0]))]
(df.eq(s, axis=1) | (s.isnull() & df.isnull())).astype(int)
                     # so NaN == NaN is True

#   0  1  2  3  4
#0  1  1  1  1  1
#1  1  1  1  1  1
#2  1  0  0  1  1
#3  0  0  0  0  0

Pandas dataframe compare all column values of index without column-name reference

Answers (2)

`DataFrame.eq`

Related Questions