Rojj
Rojj

Reputation: 1210

Pandas Dataframe index from nested values in dictionary

I am creating a pandas dataframe from historical weather data downloaded from weather underground.

import json
import requests
import pandas as pd
import numpy as np
import datetime
from dateutil.parser import parse
address = "http://api.wunderground.com/api/7036740167876b59/history_20060405/q/CA/San_Francisco.json"
r = requests.get(address)
wu_data = r.json()

Because I do not need all the data I only use the list of observations. This list contains two elements - date and utcdate - that are actually dictionaries.

df = pd.DataFrame.from_dict(wu_data["history"]["observations"])

I would like to index the dataframe I have created with the parsed date from the 'pretty' key within the dictionary. I can access this value by using the array index, but I can't figure out how to do this directly without a loop. For example, for the 23th element I can write

pretty_date  = df["date"].values[23]["pretty"]
print pretty_date
time = parse(pretty_date)
print time

And I get

11:56 PM PDT on April 05, 2006
2006-04-05 23:56:00

This is what I am doing at the moment

g = lambda x: parse(x["pretty"])
df_dates = pd.DataFrame.from_dict(df["date"])
df.index = df_date["date"].apply(g)

df is now reindexed. At this point I can remove the columns I do not need.

Is there a more direct way to do this?

Please notice that sometimes there are multiple observations for the same date, but I deal with data cleaning, duplicates, etc. in a different part of the code.

Upvotes: 0

Views: 1146

Answers (1)

Anzel
Anzel

Reputation: 20553

Since the dtype held in pretty is just object, you can simply grab them to a list and get indexed. Not sure if this is what you want:

# by the way, `r.json` should be without ()`
wu_data = r.json
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])

# just index using list comprehension, getting "pretty" inside df["date"] object.
df.index = [parse(df["date"][n]["pretty"]) for n in range(len(df))]

df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-04-05 00:56:00, ..., 2006-04-05 23:56:00]
Length: 24, Freq: None, Timezone: None

Hope this helps.

Upvotes: 1

Related Questions