Reputation: 656
Following is the head of the Pandas dataframe
that I am working on.
test.head()
Country_or_Other TotalCases TotalTests
9 USA 2026493 21725064
10 Brazil 710887 999836
11 Russia 476658 13016023
12 Spain 288797 4465338
13 UK 287399 5731576
display(test['TotalCases'].sum())
display(test['TotalTests'].sum())
7189858
10038473
I am creating two new columns using simple arithmetic. The columns will calculate the value of individual observations as a % of the totals for the original variables.
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test.head()
Country_or_Other TotalCases TotalTests TotalCases_Percent TotalTests_Percent
9 USA 2026493 21725064 28.19 -21.14
10 Brazil 710887 999836 9.89 1.00
11 Russia 476658 13016023 6.63 12.97
12 Spain 288797 4465338 4.02 4.45
13 UK 287399 5731576 4.00 5.71
test[test['Country_or_Other'] == 'USA']
Country_or_Other TotalCases TotalTests TotalCases_Percent TotalTests_Percent
9 USA 2026493 21725064 28.19 -21.14
As we can see only the **TotalTests_Percent for USA**
has generated a negative value. It should have been a +ve percentage, like all other values. Why is this? Appreciate inputs.
Complete relevant code from beginning being run on a new notebook
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request('https://www.worldometers.info/coronavirus/', headers={'User-Agent': 'Firefox/75.0'})
webpage = re.sub(r'<.*?>', lambda g: g.group(0).upper(), urlopen(req).read().decode('utf-8') )
tables = pd.read_html(webpage)
df = tables[1]
df = df.rename(columns={'Country,Other': 'Country_or_Other','Serious,Critical': 'Serious_or_Critical','Tot\xa0Cases/1M pop':'Cases_per_1M_pop', 'Tests/ 1M pop': 'Tests_per_1M_pop','Deaths/1M pop':'Deaths_per_1M_pop','Tests/ 1M pop':'Tests_per_1M_pop'})
df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(int)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(int)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)
df1 = df.drop(df.index[0:8]).drop(df.index[-8:])
cum_data = df1.drop(columns=['NewCases','NewDeaths'])
cum_data['Dead_to_Recovered'] = 100*cum_data['TotalDeaths']/cum_data['TotalRecovered']
cum_data = cum_data.sort_values('TotalCases', ascending=False)
test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test[test['Country_or_Other'] == 'USA']
Country_or_Other TotalCases TotalTests TotalCases_Percent TotalTests_Percent
9 USA 2026493 21725064 28.19 -21.14
Appreciate if check and let me know.
Upvotes: 1
Views: 219
Reputation: 1480
Solution 1:
Change TotalTests
column dtype to int64
.
Use:
df['TotalTests'] = df['TotalTests'].fillna(0).astype('int64') ## int64
instead of:
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int) ## int32
Why?
Mathematically, after multiplying the TotalTests
value for USA
by 100, its value should be 2 172 506 400
which is larger than int32
maximum value 2 147 483 648
. So, it behaved weirdly. Changing its type to int64
provides much higher maximum value.
Generally, it is preferred to use int64
in all columns that have large values and about to reach int32
max value.
Solution 2 (naive):
Multiply by 100 after division to avoid exceeding the maximum value at any point:
test['TotalTests_Percent'] = (test['TotalTests']/test['TotalTests'].sum())*100
Upvotes: 1
Reputation: 863291
Pandas here convert column to int32
instead int64
, so after multiple by 100
output is wrong:
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)
Then get int32
and after multiple by 100
get negative values:
test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]
print (test.dtypes)
Country_or_Other object
TotalCases int64
TotalTests int32
dtype: object
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']
df = test[test['Country_or_Other'] == 'USA']
print (df)
Country_or_Other TotalCases TotalTests TotalCases_Percent \
9 USA 2026493 21725064 28.18313
TotalTests_Percent
9 -2122460896
Solution is convert to np.int64
:
df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(np.int64)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(np.int64)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(np.int64)
test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]
print (test.dtypes)
Country_or_Other object
TotalCases int64
TotalTests int64
dtype: object
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
df = test[test['Country_or_Other'] == 'USA']
print (df)
Country_or_Other TotalCases TotalTests TotalCases_Percent \
9 USA 2026493 21725064 28.18313
TotalTests_Percent
9 21.641801
Upvotes: 1