Reputation: 1237
I have a pandas dataframe build from text extracted from a PDF file. It looks like this:
index date description1 description2 value1 value2
0 18-01-2019 some more 1 2
1 NaN text text NaN NaN
2 NaN here NaN NaN NaN
3 19-01-2019 some some 3 4
4 NaN text more NaN NaN
5 NaN here text NaN NaN
6 NaN NaN here NaN NaN
.
.
.
There is always at least 1 row with no NaN, and that row will always contain the Date and the Values. Only the Descriptions are on multiple rows.
Is there a way to join the rows, based on, say, date, with the rows bellow until the value in not NaN, and join the description?
Expected output:
index date description1 description2 value1 value2
0 18-01-2019 some text here more text 1 2
1 19-01-2019 some text here some more text here 3 4
.
.
.
Upvotes: 2
Views: 388
Reputation: 1065
Use fillna with ffill, then groupby this timestamp, then do stuff with the description inside agg:
df['date'] = df['date'].fillna(method='ffill')
df_new = df.groupby('date').agg({'description1': lambda x: ' '.join(x.values)})
Update: probably, for your output format, you would need to manipulate indices a little bit, like this:
df_new = df.groupby('date', as_index=False).agg({'description1': lambda x: ' '.join(x.values)}).reset_index(drop=True)
Upvotes: 1
Reputation: 863056
One idea is create column for grouping by forward filling date
(or some any column used for distinguish groups) and then if numeric get first value else use join
with remove missing values:
f = lambda x: x.iloc[0] if np.issubdtype(x.dtype, np.number) else ' '.join(x.dropna())
Or specify each columns in dictionary:
f1 = lambda x: ' '.join(x.dropna())
f = {'date':'first', 'description1':f1, 'description1':f1, 'value1':'first', 'value2':'first'}
What should be created dynamically create both dicts and merge together:
f1 = lambda x: ' '.join(x.dropna())
c =['description1','description2']
d1 = dict.fromkeys(c, f1)
d2 = dict.fromkeys(df.columns.difference(c), 'first')
f = {**d1, **d2}
df = df.groupby(df['date'].ffill()).agg(f).reset_index(drop=True)
#alternative
#df = df.groupby(df['date'].ffill(), as_index=False).agg(f)
print (df)
date description1 description2 value1 value2
0 18-01-2019 some text here more text 1.0 2.0
1 19-01-2019 some text here some more text here 3.0 4.0
Upvotes: 3