Reputation: 6382
I have a dataframe where the 'location' column contains an object:
import pandas as pd
item1 = {
'project': 'A',
'location': {'country': 'united states', 'city': 'new york'},
'raised_usd': 1.0}
item2 = {
'project': 'B',
'location': {'country': 'united kingdom', 'city': 'cambridge'},
'raised_usd': 5.0}
item3 = {
'project': 'C',
'raised_usd': 10.0}
data = [item1, item2, item3]
df = pd.DataFrame(list(data))
df
I'd like to create an extra column, 'project_country', which contains just the country information, if available. I've tried the following:
def get_country(location):
try:
return location['country']
except Exception:
return 'n/a'
df['project_country'] = get_country(df['location'])
df
But this doesn't work:
How should I go about importing this field?
Upvotes: 4
Views: 10966
Reputation: 808
When read csv file, you can use converters
option:
def string_to_dict(dict_string):`
try:
return json.loads(dict_string)
except Exception:
return "N/A"
df = pd.read_csv('../data/data.csv', converters={'locations': string_to_dict})
Access data by using from pandas import json_normalize
:
normalized_locations = json_normalize(df['locations'])
df['country'] = normalized_locations['country']
Upvotes: 0
Reputation: 31
Another way to do it - use .str[<key>]
. It implicitly call __getitem__
with key
argument for each item:
In [17]: df['location'].str['country']
Out[17]:
0 united states
1 united kingdom
2 NaN
Name: location, dtype: object
It returns NaN
in case of error and returns value otherwise.
Upvotes: 2
Reputation: 33960
With apply
, you can use operator.itemgetter
. Note we need to use dropna()
since your column contains NaN:
from operator import itemgetter
df['location'].apply(itemgetter('country'))
df['location'].dropna().apply(itemgetter('country'))
0 united states
1 united kingdom
Name: location, dtype: object
Upvotes: 0
Reputation: 16154
The correct way as EdChum pointed out is to use apply
on the 'location' column. You could compress that code in one line:
In [15]: df['location'].apply(lambda v: v.get('country') if isinstance(v, dict) else '')
Out[15]:
0 united states
1 united kingdom
2
Name: location, dtype: object
And, assign it to a column:
In [16]: df['country'] = df['location'].apply(lambda v: v.get('country') if isinstance(v, dict) else '')
In [17]: df
Out[17]:
location project raised_usd \
0 {u'country': u'united states', u'city': u'new ... A 1
1 {u'country': u'united kingdom', u'city': u'cam... B 5
2 NaN C 10
country
0 united states
1 united kingdom
2
Upvotes: 1
Reputation: 394159
Use apply
and pass your func to it:
In [62]:
def get_country(location):
try:
return location['country']
except Exception:
return 'n/a'
df['project_country'] = df['location'].apply(get_country)
df
Out[62]:
location project raised_usd \
0 {'country': 'united states', 'city': 'new york'} A 1
1 {'country': 'united kingdom', 'city': 'cambrid... B 5
2 NaN C 10
project_country
0 united states
1 united kingdom
2 n/a
The reason your original code failed is because what is passed is the entire column or pandas Series:
In [64]:
def get_country(location):
print(location)
try:
print(location['country'])
except Exception:
print('n/a')
get_country(df['location'])
0 {'country': 'united states', 'city': 'new york'}
1 {'country': 'united kingdom', 'city': 'cambrid...
2 NaN
Name: location, dtype: object
n/a
As such an attempt to find the key using the entire Series raises a KeyError
and you get 'n/a'
returned.
Upvotes: 4