Reputation: 377
Using tabula I have a dataframe of transactions from a pdf, which has the following layout:
| | Details | Transaction |
--- ------------- -------------
| 0 | Date: 11:20 | Debit |
| 1 | 01/01/2020 | $100.50 |
| 2 | Date: 13:15 | Credit |
| 3 | 01/02/2020 | $50.00 |
--- ------------- -------------
I would like merge rows that are from the same transaction. In this case rows [0, 1] and [2, 3].
My idea was to find each row number with the string 'Date' and merge those rows.
index = list(df.index[df['Details'].str.contains('Date')]
>>> [0, 2]
I would expect the output to be:
| | Details | Transaction |
--- ------------- -------------
| 0 | Date: 11:20 | Debit |
| | 01/01/2020 | $100.50 |
| 1 | Date: 13:15 | Credit |
| | 01/02/2020 | $50.00 |
--- ------------- -------------
Upvotes: 2
Views: 927
Reputation: 71687
Use Series.str.contains
+ Series.cumsum
to create a grouper g
, then use DataFrame.groupby
to group the dataframe on g
and agg
using join
:
g = df['Details'].str.contains('(?i)Date:').cumsum()
df = df.groupby(g, as_index=False).agg(' '.join)
Another idea (If the dataframe structure is fixed):
df = df.groupby(df.index // 2).agg(' '.join)
Result:
Details Transaction
0 Date: 11:20 01/01/2020 Debit $100.50
1 Date: 13:15 01/02/2020 Credit $50.00
Upvotes: 3