Reputation: 21
I have a dataframe as below.
Date Fruit level_0 Num Color
0 2013-11-25 Apple DF2 22.1 Red
1 2013-11-24 Banana DF1 22.1 Yellow
2 2013-11-24 Banana DF2 122.1 Yellow
3 2013-11-23 Celery DF1 10.2 Green
4 2013-11-24 Orange DF1 8.6 Orange
5 2013-11-24 Orange DF2 8.6 Orange1
6 2013-11-25 Orange DF1 8.6 Orange
I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.
Example of expected output below:
1.)Is it possible to get such an output? 2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?
Upvotes: 1
Views: 123
Reputation: 1766
This is pretty complex, since there are lot different filters you want to do. If I get you right, you want
-NM
. From your example you want to do the same with the "Color"-column.-Miss
.First, you'll need to make Num
a string column, since we are adding suffixes. Then we groupby Date
and Fruit
(1). Then, since you wanted the groups to have different level_0
values, we make filter on that called diff_frames
(2). Then we add the suffixes using transform on both columns if they have two unique elements (3).
df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
lambda s: s+'-NM' if s.nunique() == 2 else s)
Then, for the second part, we get the non-duplicated rows in Date
and Fruit
, and add -Miss
to the Fruit
column. (4)
df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)
Date Fruit level_0 Num Color
0 0 Apple-Miss DF2 22.1 Red
1 1 Banana DF1 22.1-NM Yellow
2 1 Banana DF2 122.1-NM Yellow
3 2 Celery-Miss DF1 10.2 Green
4 3 Orange DF1 8.6 Orange-NM
5 3 Orange DF2 8.6 Orange1-NM
6 4 Orange-Miss DF2 8.6 Orange
Upvotes: 1