Srinivas
Srinivas

Reputation: 656

Strange behaviour. A single negative value generated in Pandas dataframe, when new column is created

Following is the head of the Pandas dataframethat I am working on.

test.head()

    Country_or_Other    TotalCases  TotalTests
9   USA                    2026493    21725064
10  Brazil                  710887      999836
11  Russia                  476658    13016023
12  Spain                   288797     4465338
13  UK                      287399     5731576

display(test['TotalCases'].sum())
display(test['TotalTests'].sum())

7189858

10038473

I am creating two new columns using simple arithmetic. The columns will calculate the value of individual observations as a % of the totals for the original variables.

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test.head()


Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                2026493    21725064               28.19              -21.14
10  Brazil              710887      999836                9.89                1.00
11  Russia              476658    13016023                6.63               12.97
12  Spain               288797     4465338                4.02                4.45
13  UK                  287399     5731576                4.00                5.71

test[test['Country_or_Other'] == 'USA']

    Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                    2026493    21725064               28.19              -21.14

As we can see only the **TotalTests_Percent for USA** has generated a negative value. It should have been a +ve percentage, like all other values. Why is this? Appreciate inputs.

#

Complete relevant code from beginning being run on a new notebook

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.worldometers.info/coronavirus/', headers={'User-Agent': 'Firefox/75.0'})
webpage = re.sub(r'<.*?>', lambda g: g.group(0).upper(), urlopen(req).read().decode('utf-8') )

tables = pd.read_html(webpage)

df = tables[1]
df = df.rename(columns={'Country,Other': 'Country_or_Other','Serious,Critical': 'Serious_or_Critical','Tot\xa0Cases/1M pop':'Cases_per_1M_pop', 'Tests/  1M pop': 'Tests_per_1M_pop','Deaths/1M pop':'Deaths_per_1M_pop','Tests/ 1M pop':'Tests_per_1M_pop'})
df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(int)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(int)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)

df1 = df.drop(df.index[0:8]).drop(df.index[-8:])

cum_data = df1.drop(columns=['NewCases','NewDeaths'])
cum_data['Dead_to_Recovered'] = 100*cum_data['TotalDeaths']/cum_data['TotalRecovered']
cum_data = cum_data.sort_values('TotalCases', ascending=False)

test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test[test['Country_or_Other'] == 'USA']

    Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                    2026493    21725064               28.19              -21.14

Appreciate if check and let me know.

Upvotes: 1

Views: 219

Answers (2)

Yasser Mohsen
Yasser Mohsen

Reputation: 1480

Solution 1:

Change TotalTests column dtype to int64.

Use:

df['TotalTests'] = df['TotalTests'].fillna(0).astype('int64')   ## int64

instead of:

df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)       ## int32

Why?

Mathematically, after multiplying the TotalTests value for USA by 100, its value should be 2 172 506 400 which is larger than int32 maximum value 2 147 483 648. So, it behaved weirdly. Changing its type to int64 provides much higher maximum value.

Generally, it is preferred to use int64 in all columns that have large values and about to reach int32 max value.

Solution 2 (naive):

Multiply by 100 after division to avoid exceeding the maximum value at any point:

test['TotalTests_Percent'] = (test['TotalTests']/test['TotalTests'].sum())*100

Upvotes: 1

jezrael
jezrael

Reputation: 863291

Pandas here convert column to int32 instead int64, so after multiple by 100 output is wrong:

df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)

Then get int32 and after multiple by 100 get negative values:

test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int32
dtype: object


test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9         -2122460896  

Solution is convert to np.int64:

df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(np.int64)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(np.int64)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(np.int64)


test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int64
dtype: object

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9           21.641801  

Upvotes: 1

Related Questions