Faster way to identify and compare rows based on matching conditions within a dataframe having millions of rows

Question

I have a dataframe as below.

Actual Dataframe

         Date   Fruit level_0    Num    Color
0  2013-11-25   Apple     DF2   22.1      Red
1  2013-11-24  Banana     DF1   22.1   Yellow
2  2013-11-24  Banana     DF2  122.1   Yellow
3  2013-11-23  Celery     DF1   10.2    Green
4  2013-11-24  Orange     DF1    8.6   Orange
5  2013-11-24  Orange     DF2    8.6  Orange1
6  2013-11-25  Orange     DF1    8.6   Orange

I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.

Example of expected output below:

Expected Output

1.)Is it possible to get such an output? 2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?

Faster way to identify and compare rows based on matching conditions within a dataframe having millions of rows

Answers (1)

Related Questions