Pandas - finding the row with least value of one of the levels of a multiindex

Question

So, I have a DataFrame with a multiindex which looks like this:

                               info1 info2       info3
abc-8182 2012-05-08 10:00:00       1   6.0     "yeah!"
         2012-05-08 10:01:00       2  25.0        ":("
pli-9230 2012-05-08 11:00:00       1  30.0  "see yah!"    
         2012-05-08 11:15:00       1  30.0  "see yah!"    

...

The index is an id and a datetime representing when that info about that id was recorded. What we needed to do was to find, for each id, the earliest record. We tried a lot of options from the dataframe methods but we ended up doing it by looping through the DataFrame:

df = pandas.read_csv(...)
empty = pandas.DataFrame()
ids = df.index.get_level_values(0)
for id in ids:
    minDate  = df.xs(id).index.min()
    row = df.xs(id).xs(minDate)
    mindf = pandas.DataFrame(row).transpose()
    mindf.index = pandas.MultiIndex.from_tuples([(id, mindate)])
    empty = empty.append(mindf)

print empty.groupby(lambda x : x).first()

Which gives me:

                                               x0  x1        x2
('abc-8182', )  1   6     yeah!
('pli-9230', )  1  30  see yah!

I feel that there must be a simple, "pandas idiomatic", very immediate way to do this without looping though the data frame like this. Is there? :)

Thanks.

bdiamante · Accepted Answer

To get the first item in each group, you can do:

df.reset_index(level=1).groupby(level=0).first()

which will drop the datetime field to a column before the groups are grouped by groupby, therefore it will remain in the dataframe in the result.

If you need to ensure the earliest time is kept, you can sort, before you call first:

df.reset_index(level=1).sort_index(by="datetime").groupby(level=0).first()

Pandas - finding the row with least value of one of the levels of a multiindex

Answers (1)

Related Questions