Efficiently transforming pandas dataframe column names into row values

Question

I'm working with log data of a site key and then a key/value pair of event and timestamp. A pre-process delivers it as a pandas dataframe of indeterminable number of columns (could be 1-N), in which the column is the key and the value is the timestamp. I know this is a less-than ideal structure but I can't change it. What's an efficient way to operate on the the dataframe such that, for every key, I'm able to count non-null columns and identify the non-null values?

Here's hypothetical raw data:

KeyCol  ValCol1 ValCol2 ValCol3
A   ValCol1 ValCol2 nan
B   nan nan nan
C   ValCol1 nan ValCol3
D   nan ValCol2 nan
C   nan nan ValCol3

And here's the desired output:

KeyCol  Len Vals
A   2   [ValCol1, ValCol2]
B   0   []
C   2   [ValCol1, ValCol3]
D   1   [ValCol2]

Here's what I've done so far... but it seems inefficient. How would I optimize it?

import pandas as pd

def iterate_rows_for_nonnulls(df, grouper):
    assert isinstance(df, pd.DataFrame)
    assert isinstance(grouper, (list, tuple))

    aggdic = {}
    for row, data in df.iterrows():
        key = tuple(data[grouper])
        nonnulls = data[~data.index.isin(grouper)]
        nonnulls = nonnulls[nonnulls.notnull()]
        data[data.notnull()]
        if key not in aggdic: 
            aggdic[key] = {}
            aggdic[key]['vals'] = nonnulls
        else:
            aggdic[key]['vals'] = aggdic[key]['vals'].append(nonnulls)

    for key, val in aggdic.iteritems():
        aggdic[key]['vals'] = aggdic[key]['vals'].unique()
        aggdic[key]['len'] = len(aggdic[key]['vals'])
    # Testing using [key for key in aggdic.iteritems() if aggdic[key[0]]['len']>0 ]
    return aggdic

def construct_df_from_nonnulls(aggdic):
    assert isinstance(aggdic, dict)
    return pd.DataFrame(dict(aggdic)).T

sourcedf = pd.read_table('https://gist.githubusercontent.com/roablep/a11da82de18b14bd2c3c/raw/257f2fa7634002db267e2ef599d6e0cd330c1c72/Sourcedata', sep = "	")
aggdic = iterate_rows_for_nonnulls(sourcedf, ['KeyCol'])
resultsdf = construct_df_from_nonnulls(aggdic)

DSM · Accepted Answer

Assuming I'm understanding you correctly -- it's weird to have column names being the same as the values of all the elements in those columns -- I think you can simply throw groupby at it after melting:

>>> d2 = pd.melt(df,id_vars="KeyCol")
>>> grouped = d2.groupby("KeyCol", as_index=False)["value"]
>>> dout = grouped.agg({"Len": "nunique", 
                        "Vals": lambda x: x.dropna().unique().tolist()})
>>> dout
  KeyCol                Vals  Len
0      A  [ValCol1, ValCol2]    2
1      B                  []    0
2      C  [ValCol1, ValCol3]    2
3      D           [ValCol2]    1

Reading the doc section on the split-apply-combine pattern is highly recommended.

That said, note that putting non-scalar objects like lists in Series or DataFrames as elements is a recipe for headaches-- they're not really supported. It can sometimes be useful as a temporary measure as an intermediate step, but you can't really do much with that Series after making it.

Efficiently transforming pandas dataframe column names into row values

Answers (1)

Related Questions