Pandas join rows if at least one cell is NaN

Question

I have a pandas dataframe build from text extracted from a PDF file. It looks like this:

index      date         description1        description2        value1        value2
   0       18-01-2019    some                  more                1             2
   1       NaN           text                  text                NaN           NaN
   2       NaN           here                   NaN                NaN           NaN
   3       19-01-2019    some                  some                3             4
   4       NaN           text                  more                NaN           NaN
   5       NaN           here                  text                NaN           NaN
   6       NaN            NaN                  here                NaN           NaN
   .
   .
   .

There is always at least 1 row with no NaN, and that row will always contain the Date and the Values. Only the Descriptions are on multiple rows.

Is there a way to join the rows, based on, say, date, with the rows bellow until the value in not NaN, and join the description?

Expected output:

index      date         description1        description2           value1        value2
   0       18-01-2019    some text here      more text              1             2
   1       19-01-2019    some text here      some more text here    3             4
   .
   .
   .

jezrael · Accepted Answer

One idea is create column for grouping by forward filling date (or some any column used for distinguish groups) and then if numeric get first value else use join with remove missing values:

f = lambda x: x.iloc[0] if np.issubdtype(x.dtype, np.number) else ' '.join(x.dropna())

Or specify each columns in dictionary:

f1 = lambda x: ' '.join(x.dropna())

f = {'date':'first', 'description1':f1, 'description1':f1, 'value1':'first', 'value2':'first'}

What should be created dynamically create both dicts and merge together:

f1 = lambda x: ' '.join(x.dropna())

c =['description1','description2']
d1 = dict.fromkeys(c, f1)
d2 = dict.fromkeys(df.columns.difference(c), 'first')
f = {**d1, **d2}

df = df.groupby(df['date'].ffill()).agg(f).reset_index(drop=True)
#alternative
#df = df.groupby(df['date'].ffill(), as_index=False).agg(f)

print (df)
         date    description1         description2  value1  value2
0  18-01-2019  some text here            more text     1.0     2.0
1  19-01-2019  some text here  some more text here     3.0     4.0

Pandas join rows if at least one cell is NaN

Answers (2)

Related Questions