Reputation: 41
I have read the following file into a Pandas dataframe: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
I've viewed the file before in Excel, and cells contain the string '...' (exactly 3 dots) to represent missing values.
My problem is that after reading the file into a Pandas dataframe called 'energy', some of the missing values are no longer represented with '...' as defined in the Excel document, but rather a series of many more dots, for example: '.................................................'. This makes doing energy.replace('...', np.nan, inplace=True)
inaccurate since not all missing values are being replaced.
Could anyone explain why this behavior is occurring, and what is the best way to go about correcting it with Pandas?
This is my code:
import pandas as pd
import numpy as np
import re
# Read excel file
energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows = 17,
skipfooter = 38)
# Drop the first 2 unnecessary columns
energy.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1, inplace=True)
# Rename the remaining columns
col_names = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
energy.columns = col_names
# Convert energy supply to gigajoules
energy['Energy Supply'] = energy['Energy Supply'] * 1000000
# Replace missing values
energy.replace('...', np.nan, inplace=True)
# Replace country names according to provided to specifications
energy['Country'].replace({
'Republic of Korea': 'South Korea',
'China, Hong Kong Special Administrative Region': 'Hong Kong',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
'United States of America': 'United States'
}, inplace=True)
energy.head()
The code above results in the following dataframe: DataFrame with unexpected value circled
Upvotes: 2
Views: 262
Reputation: 4872
you can use parameters within read_excel
df = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows=17,
skipfooter=38,
na_values='...',
usecols='C:F',
names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'])
Upvotes: 0
Reputation: 98
you should place
energy.replace('...', np.nan, inplace=True)
before
energy['Energy Supply'] = energy['Energy Supply'] * 1000000
since your column datatype is object (string) , '...' * 1000000 = ......................
Upvotes: 0
Reputation: 862661
First solution is use parameter na_values
in read_excel
:
energy = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls',
skiprows = 17,
skipfooter = 38,
na_values='...')
Another solution with replace
- regex is changed to ^\.+$
for replace only mutiple dots to NaN
s:
^
is for start of string
\
for escape dot, because normally the dot sign is used in regexes to match any character
+
is for one ore more dots
$
is for end of string
energy.replace(r'^\.+$', np.nan, inplace=True, regex=True)
Upvotes: 1