Reputation: 391
I have a CSV file as such
ID OLD_A NEW_A OLD_B NEW_B OLD_C NEW_C
1 0 0 1/1/2017 1/1/2017 ABC BCD
2 0 0 1/1/2017 2/1/2017 ABC ABC
3 1 2 1/1/2017 1/1/2017 ABC BCD
I want to compare the old and new columns for A,B and C and in case there's a difference in the value of OLD and NEW (there's 10k+ rows) , I want return an output as such (from above example):
ID Field_Changed OLD_Value NEW_Value
1 C ABC BCD
2 B 1/1/2017 2/1/2017
3 A 1 2
3 C ABC BCD
So far I've used the .loc method of pandas.DataFrame which returns the indexed location of the rows that qualify the boolean indexing, but I need the values as well...
df.loc[(df['OLD_A'] != df['NEW_A'])].index)
I'm new to python scripting and cannot seem to figure the logic out. Can someone help please?
Upvotes: 1
Views: 877
Reputation: 862511
set_index
and create MultiIndex
by split
column names
stack
and rename column
filter with boolean indexing
df = df.set_index('ID')
df.columns = df.columns.str.split('_', expand=True)
df1 = df.stack()[['OLD','NEW']].reset_index().rename(columns={'level_1':'Field_Changed'})
print (df1)
ID Field_Changed OLD NEW
0 1 A 0 0
1 1 B 1/1/2017 1/1/2017
2 1 C ABC BCD
3 2 A 0 0
4 2 B 1/1/2017 2/1/2017
5 2 C ABC ABC
6 3 A 1 2
7 3 B 1/1/2017 1/1/2017
8 3 C ABC BCD
print (df1.columns)
Index(['ID', 'Field_Changed', 'OLD', 'NEW'], dtype='object')
print (df1.index)
RangeIndex(start=0, stop=9, step=1)
print (df1['OLD'] != df1['NEW'])
0 False
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
dtype: bool
df2 = df1[df1['OLD'] != df1['NEW']]
print (df2)
ID Field_Changed OLD NEW
2 1 C ABC BCD
4 2 B 1/1/2017 2/1/2017
6 3 A 1 2
8 3 C ABC BCD
Upvotes: 2