Reputation: 1128
I'm working with log data of a site key and then a key/value pair of event and timestamp. A pre-process delivers it as a pandas dataframe of indeterminable number of columns (could be 1-N), in which the column is the key and the value is the timestamp. I know this is a less-than ideal structure but I can't change it. What's an efficient way to operate on the the dataframe such that, for every key, I'm able to count non-null columns and identify the non-null values?
Here's hypothetical raw data:
KeyCol ValCol1 ValCol2 ValCol3
A ValCol1 ValCol2 nan
B nan nan nan
C ValCol1 nan ValCol3
D nan ValCol2 nan
C nan nan ValCol3
And here's the desired output:
KeyCol Len Vals
A 2 [ValCol1, ValCol2]
B 0 []
C 2 [ValCol1, ValCol3]
D 1 [ValCol2]
Here's what I've done so far... but it seems inefficient. How would I optimize it?
import pandas as pd
def iterate_rows_for_nonnulls(df, grouper):
assert isinstance(df, pd.DataFrame)
assert isinstance(grouper, (list, tuple))
aggdic = {}
for row, data in df.iterrows():
key = tuple(data[grouper])
nonnulls = data[~data.index.isin(grouper)]
nonnulls = nonnulls[nonnulls.notnull()]
data[data.notnull()]
if key not in aggdic:
aggdic[key] = {}
aggdic[key]['vals'] = nonnulls
else:
aggdic[key]['vals'] = aggdic[key]['vals'].append(nonnulls)
for key, val in aggdic.iteritems():
aggdic[key]['vals'] = aggdic[key]['vals'].unique()
aggdic[key]['len'] = len(aggdic[key]['vals'])
# Testing using [key for key in aggdic.iteritems() if aggdic[key[0]]['len']>0 ]
return aggdic
def construct_df_from_nonnulls(aggdic):
assert isinstance(aggdic, dict)
return pd.DataFrame(dict(aggdic)).T
sourcedf = pd.read_table('https://gist.githubusercontent.com/roablep/a11da82de18b14bd2c3c/raw/257f2fa7634002db267e2ef599d6e0cd330c1c72/Sourcedata', sep = "\t")
aggdic = iterate_rows_for_nonnulls(sourcedf, ['KeyCol'])
resultsdf = construct_df_from_nonnulls(aggdic)
Upvotes: 1
Views: 367
Reputation: 353099
Assuming I'm understanding you correctly -- it's weird to have column names being the same as the values of all the elements in those columns -- I think you can simply throw groupby
at it after melt
ing:
>>> d2 = pd.melt(df,id_vars="KeyCol")
>>> grouped = d2.groupby("KeyCol", as_index=False)["value"]
>>> dout = grouped.agg({"Len": "nunique",
"Vals": lambda x: x.dropna().unique().tolist()})
>>> dout
KeyCol Vals Len
0 A [ValCol1, ValCol2] 2
1 B [] 0
2 C [ValCol1, ValCol3] 2
3 D [ValCol2] 1
Reading the doc section on the split-apply-combine
pattern is highly recommended.
That said, note that putting non-scalar objects like lists in Series or DataFrames as elements is a recipe for headaches-- they're not really supported. It can sometimes be useful as a temporary measure as an intermediate step, but you can't really do much with that Series after making it.
Upvotes: 2