Strange behaviour. A single negative value generated in Pandas dataframe, when new column is created

Question

Following is the head of the Pandas dataframethat I am working on.

test.head()

    Country_or_Other    TotalCases  TotalTests
9   USA                    2026493    21725064
10  Brazil                  710887      999836
11  Russia                  476658    13016023
12  Spain                   288797     4465338
13  UK                      287399     5731576

display(test['TotalCases'].sum())
display(test['TotalTests'].sum())

7189858

10038473

I am creating two new columns using simple arithmetic. The columns will calculate the value of individual observations as a % of the totals for the original variables.

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test.head()


Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                2026493    21725064               28.19              -21.14
10  Brazil              710887      999836                9.89                1.00
11  Russia              476658    13016023                6.63               12.97
12  Spain               288797     4465338                4.02                4.45
13  UK                  287399     5731576                4.00                5.71

test[test['Country_or_Other'] == 'USA']

    Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                    2026493    21725064               28.19              -21.14

As we can see only the **TotalTests_Percent for USA** has generated a negative value. It should have been a +ve percentage, like all other values. Why is this? Appreciate inputs.

#

Complete relevant code from beginning being run on a new notebook

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.worldometers.info/coronavirus/', headers={'User-Agent': 'Firefox/75.0'})
webpage = re.sub(r'<.*?>', lambda g: g.group(0).upper(), urlopen(req).read().decode('utf-8') )

tables = pd.read_html(webpage)

df = tables[1]
df = df.rename(columns={'Country,Other': 'Country_or_Other','Serious,Critical': 'Serious_or_Critical','Tot\xa0Cases/1M pop':'Cases_per_1M_pop', 'Tests/  1M pop': 'Tests_per_1M_pop','Deaths/1M pop':'Deaths_per_1M_pop','Tests/ 1M pop':'Tests_per_1M_pop'})
df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(int)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(int)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)

df1 = df.drop(df.index[0:8]).drop(df.index[-8:])

cum_data = df1.drop(columns=['NewCases','NewDeaths'])
cum_data['Dead_to_Recovered'] = 100*cum_data['TotalDeaths']/cum_data['TotalRecovered']
cum_data = cum_data.sort_values('TotalCases', ascending=False)

test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
test[test['Country_or_Other'] == 'USA']

    Country_or_Other    TotalCases  TotalTests  TotalCases_Percent  TotalTests_Percent
9   USA                    2026493    21725064               28.19              -21.14

Appreciate if check and let me know.

jezrael · Accepted Answer

Pandas here convert column to int32 instead int64, so after multiple by 100 output is wrong:

df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)

Then get int32 and after multiple by 100 get negative values:

test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int32
dtype: object


test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9         -2122460896

Solution is convert to np.int64:

df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(np.int64)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(np.int64)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(np.int64)


test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int64
dtype: object

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9           21.641801

Strange behaviour. A single negative value generated in Pandas dataframe, when new column is created

Answers (2)

Related Questions