Reputation: 466
I have an indexed dataframe that contains many columns, some examples:
Feature1
Feature2
Feature3
Feature4
....
What i simply want to implement a function, to create a new dataframe(or another data-structure type) object which will compare values of one test-sample row values with all other rows(test sample included), if values are equal; comparison result will be "1" else "0", but as i have 91 columns, i don't want to give reference to column names, i've seen many examples that column names are given to some pandas functions.
Data example for classified_data object(NaN means null
)
_product Feature1 Feature2 Feature3 Feature4
SRI3012 1 yes IN NaN
SRI3015 1 yes IN NaN
SRS3012 1 no OUT Val1
I've simply tried :
##Choose sample
test_sample = classified_data.sample();
#Find index of random sample
test_product_code = list(test_sample.index.values)[0]
##Find location of random product in data-set
test_index = classified_data.index.get_loc(test_product_code)
#print(test_sample);
#print(classified_data[(test_index):(test_index+1)])
enum_similarity_data = pandas.DataFrame(calculate_similarity_for_categorical(classified_data[(test_index):(test_index+1)],classified_data).T,index=classified_data.index)
def calculate_similarity_for_categorical(value1,value2):
if(value1 == value2):
return 1;
else:
return 0;
Desired output for SRI3012(assumption to be choosen randomly) a dataframe or another object having column names and values:
_product Feature1 Feature2 Feature3 Feature4
SRI3012 1 1 1 1
SRI3015 1 1 1 1
SRS3012 1 0 0 0
Upvotes: 0
Views: 1121
Reputation: 514
I cannot comment, so I 'll comment here. As Quang Hoang commented, you should not use screanshots but simple and nicely-formated data that anyone who spends their precious time to help you, can copy. Also, all this complex information is not necessary. You can reproduce the notion of your question with a simple dummy DataFrame with simple values and names. This way you will get better and faster answers.
Try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Feature1':[ 1 , 1 , 1 ],
'Feature2':[ 'yes', 'yes', 'no'],
'Feature3':[ 'IN' , 'IN' , 'OUT'],
'Feature4':[np.NaN, np.NaN, 5 ]
},
index=['SR12', 'SR13', 'SR14']
)
df.index.name = '_product'
def compare_against_series(x, reference):
"""compares a Series against a reference Series"""
# apply .astype(int) to convert boolean to 0-1
return np.logical_or(x == sample, x.isnull() & sample.isnull()).astype(int)
# take the 1st row as sample
sample = df.iloc[0]
# apply compare_against_series row-wise, using the sample
# note axis=1 means row-wise and axis=0 column-wise
result = df.apply(compare_against_series, axis=1, reference=sample)
df:
Feature1 Feature2 Feature3 Feature4
_product
SR12 1 yes IN NaN
SR13 1 yes IN NaN
SR14 1 no OUT 5.0
sample:
Feature1 1
Feature2 yes
Feature3 IN
Feaure4 NaN
Name: SR12, dtype: object
result:
Feature1 Feature2 Feature3 Feautre4
_product
SR12 1 1 1 1
SR13 1 1 1 1
SR14 1 0 0 0
Upvotes: 0
Reputation: 59579
DataFrame.eq
You can check equality of one row with all other rows specifying axis=1
. The comparison here should be DataFrame.eq(Series, axis=1)
If you consider NaN == NaN
to be True
(which is not the standard) we need to deal with that separately.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'A', 'B', 'C', np.NaN], ['A', 'A', 'B', 'C', np.NaN],
['A', 'X', 'Z', 'C', np.NaN], [6, 'foo', 'bar', 12, 1231.1]])
# 0 1 2 3 4
#0 A A B C NaN
#1 A A B C NaN
#2 A X Z C NaN
#3 6 foo bar 12 1231.1
s = df.iloc[0] # or df.iloc[np.random.choice(range(df.shape[0]))]
(df.eq(s, axis=1) | (s.isnull() & df.isnull())).astype(int)
# so NaN == NaN is True
# 0 1 2 3 4
#0 1 1 1 1 1
#1 1 1 1 1 1
#2 1 0 0 1 1
#3 0 0 0 0 0
Upvotes: 1