Rocky Li
Rocky Li

Reputation: 5958

pandas gather first not null values from different columns on the same index

I have queried a huge table where majority of the rows in each columns are null (None) for each single id (there are multiple rows for each id). I want to squash these so I'll have one row per id

I'll provide a mock example of the input:

    id  var1  var2
0   1  None  None
1   1  None     x
2   2  None     k
3   3  None  None
4   1  None  None
5   1     y   NOT
6   3     p  None
7   2    ss  None
8   2  None  None

And here would be the mock example of the output:

    id var1  var2
0   1    y     x
1   2   ss     k
2   3    p  None

Note how "NOT" from var2 column did not make it, because it's the second non-null value for id==1.

The following snippet works if there are only one value for each id:

df = df.set_index('id')
cs = [df[col].dropna() for col in df]
pd.concat(cs, axis=1)

But if there are multiple values (such as NOT in the example), this throw an error.

Is there an easy way to aggregate the first not null value for each column for each id? Thanks.

Here's the mock example pandas dataframe, I'll have a lot more columns and rows for the real dataframe:

df = pd.DataFrame([[1, None, None],[1, None, "x"],[2, None, "k"],[3,None, None], [1, None, None], [1, "y", "NOT"], [3, "p", None], [2, "ss", None], [2, None, None]])

Upvotes: 0

Views: 627

Answers (2)

gilgorio
gilgorio

Reputation: 558

idColumnName = 'id' # Define id column name
def funcGroup(x): # Define grouping function
    columns = [x for x in x.columns if x!=idColumnName] # Get columns that are not id column
    dictValues = {} # Init a dictionary
    for col_ in columns: # For each column ...
        nonNullValues = x.loc[~x[col_].isnull(), col_] # Get non null values
        firstNonNullValue = np.nan # Init result value as null
        if nonNullValues.shape[0]>0: firstNonNullValue = nonNullValues.iloc[0] # If there are not null values return first
        dictValues[col_] = firstNonNullValue # Fill the dictionary
    return pd.Series(dictValues, columns) # Return a pandas Serie

df.groupby([idColumnName]).apply(funcGroup).reset_index() # Group by your id column and apply defined function

Upvotes: 1

BENY
BENY

Reputation: 323316

You can check with first : first not null value will be returned

df.groupby(0).first().reset_index()
Out[582]: 
   0     1     2
0  1  None  None
1  2  None     k
2  3  None  None

Upvotes: 0

Related Questions