Pandas: How to iterate over level one and randomly select from level 2 then add subselection to dataframe

Question

I am just starting with pandas and may have bitten off more than I can chew. I have a Dataframe with a MultiIndex over which I want to loop over the 1st ('type') level, then randomly select from the values from the 2nd ('lwc') level, and then create a sub Dataframe of this subselection which I then add to another Dataframe.

Dataframe is spec_df with level names:

spec_df.columns.names
FrozenList([u'type', u'lwc', u'rad', u'cl_top', u'wvc', u'aot', u'press', u'sza', u'phi0', u'umu', u'phi'])

The code I have so far:

rand_clds = pd.DataFrame([])

for l1 in spec_df.columns.levels[0]:
    l2l = spec_df[l1].columns.levels[0]
    rand_l2 = np.random.choice(l2l)
    rand_clds[l1, rand_l2] = spec_df.ix[[l1, rand_l2]]

Works well up to the start of the loop. l2l contains all the values of level 'lwc' and not just the subset of l1 in 'type'.

unutbu · Accepted Answer

Suppose spec_df looks like this:

In [141]: spec_df
Out[141]: 
      foo bar foo baz    bar
        A   B   B   C  A   D
        1   2   3   1  2   3
baz C   2   2   9   6  8   5
    D   7   8   0   6  7   8
qux C   3   8   6   9  2   3
    D   1   2   6   2  9   8
    C   5   8   4   8  9   1

Then you can subselect columns by passing a list of tuples to spec_df. For example, if cols equals

In [140]: cols
Out[140]: [('baz', 'A', '2'), ('foo', 'B', '3'), ('bar', 'D', '3')]

then

In [147]: spec_df[cols]
Out[147]: 
      baz foo bar
        A   B   D
        2   3   3
baz C   8   9   5
    D   7   0   8
qux C   2   6   3
    D   9   6   8
    C   9   4   1

That would solve the problem of how to select a sub-DataFrame, if only we could construct cols. That turns out to be not so hard using plain Python. Simply collect the columns in a dict which maps the first column level value to the full column tuple:

columns = spec_df.columns
seen = dict()
for col in columns:
    seen.setdefault(col[0], []).append(col)
# >>> seen
# {'bar': [('bar', 'B', '2'), ('bar', 'D', '3')],
#  'baz': [('baz', 'C', '1'), ('baz', 'A', '2')],
#  'foo': [('foo', 'A', '1'), ('foo', 'B', '3')]}

Then use random.choice to select one column tuple for each key in seen:

cols = [random.choice(seen[firstcol]) for firstcol in seen]

Putting it all together:

import random
import numpy as np
import pandas as pd
random.seed(1)
spec_df = pd.DataFrame(
    np.random.randint(10, size=(5,6)),
    columns=pd.MultiIndex.from_arrays([['foo','bar','foo','baz','baz','bar'],
                                       list('ABBCAD'),
                                   list('123123')]),
    index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                     list('CDCDC')]))

columns = spec_df.columns
seen = dict()
for col in columns:
    seen.setdefault(col[0], []).append(col)
cols = [random.choice(seen[firstcol]) for firstcol in seen]
print(spec_df[cols])

yields

      baz foo bar
        A   B   D
        2   3   3
baz C   8   9   5
    D   7   0   8
qux C   2   6   3
    D   9   6   8
    C   9   4   1

Pandas: How to iterate over level one and randomly select from level 2 then add subselection to dataframe

Answers (2)

Related Questions