Reputation: 5958
I have queried a huge table where majority of the rows in each columns are null (None
) for each single id
(there are multiple rows for each id
). I want to squash these so I'll have one row per id
I'll provide a mock example of the input:
id var1 var2
0 1 None None
1 1 None x
2 2 None k
3 3 None None
4 1 None None
5 1 y NOT
6 3 p None
7 2 ss None
8 2 None None
And here would be the mock example of the output:
id var1 var2
0 1 y x
1 2 ss k
2 3 p None
Note how "NOT"
from var2
column did not make it, because it's the second non-null value for id==1
.
The following snippet works if there are only one value for each id:
df = df.set_index('id')
cs = [df[col].dropna() for col in df]
pd.concat(cs, axis=1)
But if there are multiple values (such as NOT
in the example), this throw an error.
Is there an easy way to aggregate the first not null value for each column for each id
? Thanks.
Here's the mock example pandas dataframe, I'll have a lot more columns and rows for the real dataframe:
df = pd.DataFrame([[1, None, None],[1, None, "x"],[2, None, "k"],[3,None, None], [1, None, None], [1, "y", "NOT"], [3, "p", None], [2, "ss", None], [2, None, None]])
Upvotes: 0
Views: 627
Reputation: 558
idColumnName = 'id' # Define id column name
def funcGroup(x): # Define grouping function
columns = [x for x in x.columns if x!=idColumnName] # Get columns that are not id column
dictValues = {} # Init a dictionary
for col_ in columns: # For each column ...
nonNullValues = x.loc[~x[col_].isnull(), col_] # Get non null values
firstNonNullValue = np.nan # Init result value as null
if nonNullValues.shape[0]>0: firstNonNullValue = nonNullValues.iloc[0] # If there are not null values return first
dictValues[col_] = firstNonNullValue # Fill the dictionary
return pd.Series(dictValues, columns) # Return a pandas Serie
df.groupby([idColumnName]).apply(funcGroup).reset_index() # Group by your id column and apply defined function
Upvotes: 1
Reputation: 323316
You can check with first
: first not null value will be returned
df.groupby(0).first().reset_index()
Out[582]:
0 1 2
0 1 None None
1 2 None k
2 3 None None
Upvotes: 0