Slartibartfast
Slartibartfast

Reputation: 1200

while loop to constantly recheck changes in Pandas dataframe

I have two identical dataframes new and old. The new dataframe will be updated randomly throughout the day. The code below checks if there are any changes.

import pandas as pd
import numpy as np

new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
                'episodes': [42, 24, 31, 29, 37, 40],
                'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
                'episodes': [12, 32, 31, 32, 37, 40],
                'gender': ['male', 'female', 'female', 'female', 'male', 'male']}    

df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])    
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])

while True:
    df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])    
    print(df[~df.episodes.eq(df1.episodes)])
    df1 = df

I need to write condition inside a while loop where df[~df.episodes.eq(df1.episodes)] is only printed if there is a change detected. After the new data is printed it would set both the dataframe to same values(because old data is not longer needed) and recheck for changes. The above code would print :

Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame

Thus one would miss if a change has actually been printed. Could you please advise a more efficient way of accomplishing this.

== Edit ==

As per the anwer by @BENY if i do this:

import pandas as pd
import numpy as np

new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
                'episodes': [42, 24, 31, 29, 37, 40],
                'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
                'episodes': [12, 32, 31, 32, 37, 40],
                'gender': ['male', 'female', 'female', 'female', 'male', 'male']}    

df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])    
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])

while True:
    df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])    
    out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
    print(out)
    df = df1

It would print this out for the entire whileloop:

         name  episodes  gender     _merge
0     Sheldon        12    male  left_only
1       Penny        32  female  left_only
3  Bernadette        32  female  left_only
         name  episodes  gender     _merge
0     Sheldon        12    male  left_only
1       Penny        32  female  left_only
3  Bernadette        32  female  left_only
         name  episodes  gender     _merge
0     Sheldon        12    male  left_only
1       Penny        32  female  left_only
3  Bernadette        32  female  left_only

Is there any way it would print it only once. Until there is another change. If i df= df1 than it would print the as below and I would miss the change:

Columns: [name, episodes, gender, _merge]
Index: []
Empty DataFrame
Columns: [name, episodes, gender, _merge]

I need to get this data cleanly where the changes are detected.

Upvotes: 1

Views: 224

Answers (2)

SeaBean
SeaBean

Reputation: 23227

If you want to compare the 2 dataframe and check for any changes/differences, why not use the DataFrame.compare() function ?

Here's the sample output based on your sample data:

df.compare(df1)

Output:

    episodes      
    self other
0   12.0  42.0
1   32.0  24.0
3   32.0  29.0

By default, it highlight only the differences. Here, it shows that only the column episodes has differences.
self corresponds to the values of df and other corresponds to the values df1

The indexes on the left, ie. 0, 1, and 3 show the row indexes with differences.

If you want to show the whole original shape, you can also use the keep_shape= parameter, as follows:

df.compare(df1, keep_shape=True)

Output:

  name       episodes       gender      
  self other     self other   self other
0  NaN   NaN     12.0  42.0    NaN   NaN
1  NaN   NaN     32.0  24.0    NaN   NaN
2  NaN   NaN      NaN   NaN    NaN   NaN
3  NaN   NaN     32.0  29.0    NaN   NaN
4  NaN   NaN      NaN   NaN    NaN   NaN
5  NaN   NaN      NaN   NaN    NaN   NaN

Only different values are shown. NaN values are those without differences.

Of course, if you like, you can also choose to display all values, including equal values, as follows:

df.compare(df1, keep_shape=True, keep_equal=True)

Output

         name             episodes        gender        
         self       other     self other    self   other
0     Sheldon     Sheldon       12    42    male    male
1       Penny       Penny       32    24  female  female
2         Amy         Amy       31    31  female  female
3  Bernadette  Bernadette       32    29  female  female
4         Raj         Raj       37    37    male    male
5      Howard      Howard       40    40    male    male

This options allow you side by side comparison to inspect the difference. Anyway, it is less easy to spot the differences.

I would suggest you take the default options to display only the differences first (may be write down the indexes of rows with differences) and optionally, use the other 2 options only when you want detailed inspection of the other side values (which are equal).

To use under your while loop, you can use:

while True:
    df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])    
    out = df.compare(df1)
    print(out)
    df = df1

Edit

If you want to see the name while maintaining to see only the differences of other columns, you can set index with append=True, as follows:

df.set_index('name', append=True).compare(df1.set_index('name', append=True))

Output

                 episodes      
                 self other
  name                     
0 Sheldon        12.0  42.0
1 Penny          32.0  24.0
3 Bernadette     32.0  29.0

This way, you can see both the name and the row indexes with differences.

Upvotes: 1

BENY
BENY

Reputation: 323356

Let us try with merge

out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
         name  episodes  gender     _merge
0     Sheldon        12    male  left_only
1       Penny        32  female  left_only
3  Bernadette        32  female  left_only

Upvotes: 1

Related Questions