Reputation: 1200
I have two identical dataframes new
and old
. The new
dataframe will be updated randomly throughout the day. The code below checks if there are any changes.
import pandas as pd
import numpy as np
new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
'episodes': [42, 24, 31, 29, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
'episodes': [12, 32, 31, 32, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
print(df[~df.episodes.eq(df1.episodes)])
df1 = df
I need to write condition inside a while
loop where df[~df.episodes.eq(df1.episodes)]
is only printed if there is a change detected. After the new data is printed it would set both the dataframe to same values(because old data is not longer needed) and recheck for changes. The above code would print :
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Thus one would miss if a change has actually been printed. Could you please advise a more efficient way of accomplishing this.
== Edit ==
As per the anwer by @BENY if i do this:
import pandas as pd
import numpy as np
new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
'episodes': [42, 24, 31, 29, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
'episodes': [12, 32, 31, 32, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
print(out)
df = df1
It would print this out for the entire whileloop:
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
Is there any way it would print it only once. Until there is another change. If i df= df1
than it would print the as below and I would miss the change:
Columns: [name, episodes, gender, _merge]
Index: []
Empty DataFrame
Columns: [name, episodes, gender, _merge]
I need to get this data cleanly where the changes are detected.
Upvotes: 1
Views: 224
Reputation: 23227
If you want to compare the 2 dataframe and check for any changes/differences, why not use the DataFrame.compare()
function ?
Here's the sample output based on your sample data:
df.compare(df1)
Output:
episodes
self other
0 12.0 42.0
1 32.0 24.0
3 32.0 29.0
By default, it highlight only the differences. Here, it shows that only the column episodes
has differences.
self
corresponds to the values of df
and other
corresponds to the values df1
The indexes on the left, ie. 0
, 1
, and 3
show the row indexes with differences.
If you want to show the whole original shape, you can also use the keep_shape=
parameter, as follows:
df.compare(df1, keep_shape=True)
Output:
name episodes gender
self other self other self other
0 NaN NaN 12.0 42.0 NaN NaN
1 NaN NaN 32.0 24.0 NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN 32.0 29.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
Only different values are shown. NaN
values are those without differences.
Of course, if you like, you can also choose to display all values, including equal values, as follows:
df.compare(df1, keep_shape=True, keep_equal=True)
Output
name episodes gender
self other self other self other
0 Sheldon Sheldon 12 42 male male
1 Penny Penny 32 24 female female
2 Amy Amy 31 31 female female
3 Bernadette Bernadette 32 29 female female
4 Raj Raj 37 37 male male
5 Howard Howard 40 40 male male
This options allow you side by side comparison to inspect the difference. Anyway, it is less easy to spot the differences.
I would suggest you take the default options to display only the differences first (may be write down the indexes of rows with differences) and optionally, use the other 2 options only when you want detailed inspection of the other side values (which are equal).
To use under your while
loop, you can use:
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
out = df.compare(df1)
print(out)
df = df1
If you want to see the name
while maintaining to see only the differences of other columns, you can set index with append=True
, as follows:
df.set_index('name', append=True).compare(df1.set_index('name', append=True))
Output
episodes
self other
name
0 Sheldon 12.0 42.0
1 Penny 32.0 24.0
3 Bernadette 32.0 29.0
This way, you can see both the name
and the row indexes with differences.
Upvotes: 1
Reputation: 323356
Let us try with merge
out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
Upvotes: 1