code_monkey767
code_monkey767

Reputation: 41

Inaccurate value in my Pandas dataframe column after reading an external Excel file

I have read the following file into a Pandas dataframe: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
I've viewed the file before in Excel, and cells contain the string '...' (exactly 3 dots) to represent missing values.

My problem is that after reading the file into a Pandas dataframe called 'energy', some of the missing values are no longer represented with '...' as defined in the Excel document, but rather a series of many more dots, for example: '.................................................'. This makes doing energy.replace('...', np.nan, inplace=True) inaccurate since not all missing values are being replaced.

Could anyone explain why this behavior is occurring, and what is the best way to go about correcting it with Pandas?

This is my code:

import pandas as pd
import numpy as np
import re

# Read excel file
energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
                      skiprows = 17,
                      skipfooter = 38)

# Drop the first 2 unnecessary columns
energy.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1, inplace=True)

# Rename the remaining columns
col_names = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
energy.columns = col_names

# Convert energy supply to gigajoules
energy['Energy Supply'] = energy['Energy Supply'] * 1000000

# Replace missing values
energy.replace('...', np.nan, inplace=True)

# Replace country names according to provided to specifications
energy['Country'].replace({
    'Republic of Korea': 'South Korea',
    'China, Hong Kong Special Administrative Region': 'Hong Kong',
    'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
    'United States of America': 'United States'
}, inplace=True)

energy.head()

The code above results in the following dataframe: DataFrame with unexpected value circled

Upvotes: 2

Views: 262

Answers (3)

Shijith
Shijith

Reputation: 4872

you can use parameters within read_excel

df = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
                  skiprows=17,
                  skipfooter=38,
                  na_values='...',
                  usecols='C:F',
                  names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'])

Upvotes: 0

Shoobi
Shoobi

Reputation: 98

you should place

energy.replace('...', np.nan, inplace=True)

before

energy['Energy Supply'] = energy['Energy Supply'] * 1000000

since your column datatype is object (string) , '...' * 1000000 = ......................

Upvotes: 0

jezrael
jezrael

Reputation: 862661

First solution is use parameter na_values in read_excel:

energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
                      skiprows = 17,
                      skipfooter = 38,
                      na_values='...')

Another solution with replace - regex is changed to ^\.+$ for replace only mutiple dots to NaNs:

^ is for start of string
\ for escape dot, because normally the dot sign is used in regexes to match any character
+ is for one ore more dots
$ is for end of string


energy.replace(r'^\.+$', np.nan, inplace=True, regex=True)

Upvotes: 1

Related Questions