Hayra
Hayra

Reputation: 466

Pandas dataframe compare all column values of index without column-name reference

I have an indexed dataframe that contains many columns, some examples:

    Feature1
    Feature2
    Feature3
    Feature4
....

What i simply want to implement a function, to create a new dataframe(or another data-structure type) object which will compare values of one test-sample row values with all other rows(test sample included), if values are equal; comparison result will be "1" else "0", but as i have 91 columns, i don't want to give reference to column names, i've seen many examples that column names are given to some pandas functions.

Data example for classified_data object(NaN means null)

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        yes         IN    NaN
SRI3015  1        yes         IN    NaN
SRS3012  1        no          OUT   Val1

I've simply tried :

##Choose sample
    test_sample = classified_data.sample();
#Find index of random sample
    test_product_code = list(test_sample.index.values)[0]
##Find location of random product in data-set
    test_index = classified_data.index.get_loc(test_product_code)
    #print(test_sample);
    #print(classified_data[(test_index):(test_index+1)])
    enum_similarity_data = pandas.DataFrame(calculate_similarity_for_categorical(classified_data[(test_index):(test_index+1)],classified_data).T,index=classified_data.index)


def calculate_similarity_for_categorical(value1,value2):
    if(value1 == value2):
        return 1;
    else:
        return 0;

Desired output for SRI3012(assumption to be choosen randomly) a dataframe or another object having column names and values:

_product Feature1 Feature2 Feature3 Feature4
SRI3012  1        1        1        1
SRI3015  1        1        1        1
SRS3012  1        0        0        0

Upvotes: 0

Views: 1121

Answers (2)

Thanasis Mattas
Thanasis Mattas

Reputation: 514

I cannot comment, so I 'll comment here. As Quang Hoang commented, you should not use screanshots but simple and nicely-formated data that anyone who spends their precious time to help you, can copy. Also, all this complex information is not necessary. You can reproduce the notion of your question with a simple dummy DataFrame with simple values and names. This way you will get better and faster answers.

Try this:

import numpy as np
import pandas as pd


df = pd.DataFrame({'Feature1':[    1 ,     1 ,    1 ],
                   'Feature2':[ 'yes',  'yes',  'no'], 
                   'Feature3':[ 'IN' ,  'IN' , 'OUT'],
                   'Feature4':[np.NaN, np.NaN,    5 ]
                  },
                  index=['SR12', 'SR13', 'SR14']
)
df.index.name = '_product'

def compare_against_series(x, reference):
    """compares a Series against a reference Series"""
    # apply .astype(int) to convert boolean to 0-1
    return np.logical_or(x == sample, x.isnull() & sample.isnull()).astype(int)

# take the 1st row as sample 
sample = df.iloc[0]

# apply compare_against_series row-wise, using the sample
# note axis=1 means row-wise and axis=0 column-wise
result = df.apply(compare_against_series, axis=1, reference=sample)

df:

          Feature1 Feature2 Feature3 Feature4
_product                            
SR12             1      yes       IN      NaN
SR13             1      yes       IN      NaN
SR14             1       no      OUT      5.0

sample:

Feature1      1
Feature2    yes
Feature3     IN
Feaure4     NaN
Name: SR12, dtype: object

result:

          Feature1  Feature2  Feature3  Feautre4
_product                              
SR12             1         1         1         1
SR13             1         1         1         1
SR14             1         0         0         0

Upvotes: 0

ALollz
ALollz

Reputation: 59579

DataFrame.eq

You can check equality of one row with all other rows specifying axis=1. The comparison here should be DataFrame.eq(Series, axis=1) If you consider NaN == NaN to be True (which is not the standard) we need to deal with that separately.

import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'A', 'B', 'C', np.NaN], ['A', 'A', 'B', 'C', np.NaN], 
                   ['A', 'X', 'Z', 'C', np.NaN], [6, 'foo', 'bar', 12, 1231.1]])
#   0    1    2   3       4
#0  A    A    B   C     NaN
#1  A    A    B   C     NaN
#2  A    X    Z   C     NaN
#3  6  foo  bar  12  1231.1

s = df.iloc[0]  # or df.iloc[np.random.choice(range(df.shape[0]))]
(df.eq(s, axis=1) | (s.isnull() & df.isnull())).astype(int)
                     # so NaN == NaN is True

#   0  1  2  3  4
#0  1  1  1  1  1
#1  1  1  1  1  1
#2  1  0  0  1  1
#3  0  0  0  0  0

Upvotes: 1

Related Questions