Reputation: 79
I have a DataFrame to find the percentage of absolute difference between two sources in python. But when I use the below code, few of the columns are giving the -% (minus percentages)
I have checked the columns showing minus percentage datatypes are same in both the sources.
Can anyone help me to find out why?
# Define the columns you want to process
columns = ['a', 'b', 'c', 'd']
# Create the results DataFrame
results_df = pd.DataFrame()
results_df['date'] = c_df['date']
results_df['id'] = c_df['id']
for col in columns:
# calculating the absolute difference
diff = np.abs(c_df[f'{col}_s1'] - c_df[f'{col}_s2'])
# calculating mean for snowflake columns
norm_diff = c_df[f'{col}_s1'].mean()
# Avoid division by zero
if norm_diff == 0:
percentage = np.zeros(len(diff))
else:
# calculating percentage difference b/w the columns in both datasets
percentage = (diff / norm_diff) * 100
# Round the percentage_difference and add % symbol
results_df[col] = [f"{round(diff)}%" for diff in percentage]
display(results_df)
And the results are something like this.
a | b | c | d |
---|---|---|---|
0% | 5% | 10% | -5% |
10% | 100% | -50% | 200% |
Upvotes: 1
Views: 74
Reputation: 1987
Can't be sure without seeing your data, but it could be the way you're calculating norm_diff. When you run norm_diff = c_df[f'{col}_s1'].mean()
, you'll end up with negative values if the average for the column is negative. That would give you a negative value when you calculate percentage
later.
Try logging the value of norm_diff
and percentage
after each loop; if this is what's happening, percentage
will be negative exactly when norm_diff
is.
Depending on what you're looking for, you might be able to fix this just by taking the absolute value when you calculate norm_diff
:
- norm_diff = c_df[f'{col}_s1'].mean()
+ norm_diff = abs(c_df[f'{col}_s1'].mean())
Upvotes: 1