asha sasidharan
asha sasidharan

Reputation: 21

Faster way to identify and compare rows based on matching conditions within a dataframe having millions of rows

I have a dataframe as below.

Actual Dataframe

         Date   Fruit level_0    Num    Color
0  2013-11-25   Apple     DF2   22.1      Red
1  2013-11-24  Banana     DF1   22.1   Yellow
2  2013-11-24  Banana     DF2  122.1   Yellow
3  2013-11-23  Celery     DF1   10.2    Green
4  2013-11-24  Orange     DF1    8.6   Orange
5  2013-11-24  Orange     DF2    8.6  Orange1
6  2013-11-25  Orange     DF1    8.6   Orange

I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.

Example of expected output below:

Expected Output

1.)Is it possible to get such an output? 2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?

Upvotes: 1

Views: 123

Answers (1)

Bertil Johannes Ipsen
Bertil Johannes Ipsen

Reputation: 1766

This is pretty complex, since there are lot different filters you want to do. If I get you right, you want

  1. for rows that have the same "Date" and "Fruit" values, and
  2. of those rows, those that have different "level_0" values, and
  3. of those rows, those that have different "Num" values to get -NM. From your example you want to do the same with the "Color"-column.
  4. Rows that are the only occurence of a "Date" and "Fruit" value get -Miss.

First, you'll need to make Num a string column, since we are adding suffixes. Then we groupby Date and Fruit (1). Then, since you wanted the groups to have different level_0 values, we make filter on that called diff_frames (2). Then we add the suffixes using transform on both columns if they have two unique elements (3).

df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
    lambda s: s+'-NM' if s.nunique() == 2 else s)

Then, for the second part, we get the non-duplicated rows in Date and Fruit, and add -Miss to the Fruit column. (4)

df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)


   Date        Fruit level_0       Num       Color
0     0   Apple-Miss     DF2      22.1         Red
1     1       Banana     DF1   22.1-NM      Yellow
2     1       Banana     DF2  122.1-NM      Yellow
3     2  Celery-Miss     DF1      10.2       Green
4     3       Orange     DF1       8.6   Orange-NM
5     3       Orange     DF2       8.6  Orange1-NM
6     4  Orange-Miss     DF2       8.6      Orange

Upvotes: 1

Related Questions