Peter
Peter

Reputation: 1128

Efficiently transforming pandas dataframe column names into row values

I'm working with log data of a site key and then a key/value pair of event and timestamp. A pre-process delivers it as a pandas dataframe of indeterminable number of columns (could be 1-N), in which the column is the key and the value is the timestamp. I know this is a less-than ideal structure but I can't change it. What's an efficient way to operate on the the dataframe such that, for every key, I'm able to count non-null columns and identify the non-null values?

Here's hypothetical raw data:

KeyCol  ValCol1 ValCol2 ValCol3
A   ValCol1 ValCol2 nan
B   nan nan nan
C   ValCol1 nan ValCol3
D   nan ValCol2 nan
C   nan nan ValCol3

And here's the desired output:

KeyCol  Len Vals
A   2   [ValCol1, ValCol2]
B   0   []
C   2   [ValCol1, ValCol3]
D   1   [ValCol2]

Here's what I've done so far... but it seems inefficient. How would I optimize it?

import pandas as pd

def iterate_rows_for_nonnulls(df, grouper):
    assert isinstance(df, pd.DataFrame)
    assert isinstance(grouper, (list, tuple))

    aggdic = {}
    for row, data in df.iterrows():
        key = tuple(data[grouper])
        nonnulls = data[~data.index.isin(grouper)]
        nonnulls = nonnulls[nonnulls.notnull()]
        data[data.notnull()]
        if key not in aggdic: 
            aggdic[key] = {}
            aggdic[key]['vals'] = nonnulls
        else:
            aggdic[key]['vals'] = aggdic[key]['vals'].append(nonnulls)

    for key, val in aggdic.iteritems():
        aggdic[key]['vals'] = aggdic[key]['vals'].unique()
        aggdic[key]['len'] = len(aggdic[key]['vals'])
    # Testing using [key for key in aggdic.iteritems() if aggdic[key[0]]['len']>0 ]
    return aggdic

def construct_df_from_nonnulls(aggdic):
    assert isinstance(aggdic, dict)
    return pd.DataFrame(dict(aggdic)).T

sourcedf = pd.read_table('https://gist.githubusercontent.com/roablep/a11da82de18b14bd2c3c/raw/257f2fa7634002db267e2ef599d6e0cd330c1c72/Sourcedata', sep = "\t")
aggdic = iterate_rows_for_nonnulls(sourcedf, ['KeyCol'])
resultsdf = construct_df_from_nonnulls(aggdic)

Upvotes: 1

Views: 367

Answers (1)

DSM
DSM

Reputation: 353099

Assuming I'm understanding you correctly -- it's weird to have column names being the same as the values of all the elements in those columns -- I think you can simply throw groupby at it after melting:

>>> d2 = pd.melt(df,id_vars="KeyCol")
>>> grouped = d2.groupby("KeyCol", as_index=False)["value"]
>>> dout = grouped.agg({"Len": "nunique", 
                        "Vals": lambda x: x.dropna().unique().tolist()})
>>> dout
  KeyCol                Vals  Len
0      A  [ValCol1, ValCol2]    2
1      B                  []    0
2      C  [ValCol1, ValCol3]    2
3      D           [ValCol2]    1

Reading the doc section on the split-apply-combine pattern is highly recommended.

That said, note that putting non-scalar objects like lists in Series or DataFrames as elements is a recipe for headaches-- they're not really supported. It can sometimes be useful as a temporary measure as an intermediate step, but you can't really do much with that Series after making it.

Upvotes: 2

Related Questions