Reputation: 1013
I am just starting with pandas and may have bitten off more than I can chew. I have a Dataframe with a MultiIndex over which I want to loop over the 1st ('type'
) level, then randomly select from the values from the 2nd ('lwc'
) level, and then create a sub Dataframe of this subselection which I then add to another Dataframe.
Dataframe is spec_df
with level names:
spec_df.columns.names
FrozenList([u'type', u'lwc', u'rad', u'cl_top', u'wvc', u'aot', u'press', u'sza', u'phi0', u'umu', u'phi'])
The code I have so far:
rand_clds = pd.DataFrame([])
for l1 in spec_df.columns.levels[0]:
l2l = spec_df[l1].columns.levels[0]
rand_l2 = np.random.choice(l2l)
rand_clds[l1, rand_l2] = spec_df.ix[[l1, rand_l2]]
Works well up to the start of the loop. l2l
contains all the values of level 'lwc'
and not just the subset of l1
in 'type'
.
Upvotes: 1
Views: 175
Reputation: 879899
Suppose spec_df
looks like this:
In [141]: spec_df
Out[141]:
foo bar foo baz bar
A B B C A D
1 2 3 1 2 3
baz C 2 2 9 6 8 5
D 7 8 0 6 7 8
qux C 3 8 6 9 2 3
D 1 2 6 2 9 8
C 5 8 4 8 9 1
Then you can subselect columns by passing a list of tuples to spec_df
. For
example, if cols
equals
In [140]: cols
Out[140]: [('baz', 'A', '2'), ('foo', 'B', '3'), ('bar', 'D', '3')]
then
In [147]: spec_df[cols]
Out[147]:
baz foo bar
A B D
2 3 3
baz C 8 9 5
D 7 0 8
qux C 2 6 3
D 9 6 8
C 9 4 1
That would solve the problem of how to select a sub-DataFrame, if only we could
construct cols
. That turns out to be not so hard using plain Python. Simply
collect the columns in a dict which maps the first column level value to the
full column tuple:
columns = spec_df.columns
seen = dict()
for col in columns:
seen.setdefault(col[0], []).append(col)
# >>> seen
# {'bar': [('bar', 'B', '2'), ('bar', 'D', '3')],
# 'baz': [('baz', 'C', '1'), ('baz', 'A', '2')],
# 'foo': [('foo', 'A', '1'), ('foo', 'B', '3')]}
Then use random.choice
to select one column tuple for each key in
seen
:
cols = [random.choice(seen[firstcol]) for firstcol in seen]
Putting it all together:
import random
import numpy as np
import pandas as pd
random.seed(1)
spec_df = pd.DataFrame(
np.random.randint(10, size=(5,6)),
columns=pd.MultiIndex.from_arrays([['foo','bar','foo','baz','baz','bar'],
list('ABBCAD'),
list('123123')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
columns = spec_df.columns
seen = dict()
for col in columns:
seen.setdefault(col[0], []).append(col)
cols = [random.choice(seen[firstcol]) for firstcol in seen]
print(spec_df[cols])
yields
baz foo bar
A B D
2 3 3
baz C 8 9 5
D 7 0 8
qux C 2 6 3
D 9 6 8
C 9 4 1
Upvotes: 2
Reputation: 1013
This does not answer my own question exactly. I leave that to the genius that will, it there is one. Here is something that simply selects the 1st column for every value of the 1st level I loop over. Not ideal but, yes it gives me something:
rand_clds = pd.DataFrame([])
for l1 in spec_df.columns.levels[0]:
rand_clds[l1] = spec_df[l1].icol(0)
The problem is that it the rand_clds
dataframe doesn't have the column names etc. of the original dataframe. I can at least plot something with this.
Upvotes: 0