Reputation: 607
I have csv file parsed from pdf file and it is not correctly parsed as the table in pdf file has multiple rows. importing it into pd DataFrame looks like this.
Record Operational Address BIC
2 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, AAAARSBG
3 NaN REPUBLIC OF NaN
4 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA AAACKWKW
5 NaN OPPOSITE PUBLIC LIBRARY 13022 KUWAIT NaN
6 NaN KUWAIT NaN
7 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS AAADFRP1
8 NaN FRANCE NaN
9 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
10 NaN NaN NaN
11 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS AAAJBG21
12 NaN CENTER LEGIS 4000 PLOVDIV BULGARIA NaN
13 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 AAALSARI
14 NaN RIYADH SAUDI ARABIA NaN
I want to concatenate next rows into current row if next value in Record column is NaN.
I mean I want to get
Record Operational Address BIC
2 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, REPUBLIC OF AAAARSBG
4 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA OPPOSITE PUBLIC LIBRARY 13022 KUWAIT KUWAIT AAACKWKW
7 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS FRANCE AAADFRP1
9 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
11 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS CENTER LEGIS 4000 PLOVDIV BULGARIA AAAJBG21
13 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 RIYADH SAUDI ARABIA AAALSARI
here is dataframe
import numpy a np
data = {'Record': {2: '2007-03-03',
3: np.NaN,
4: '1994-03-07',
5: np.NaN,
6: np.NaN,
7: '2006-06-03',
8: np.NaN,
9: '2006-06-03',
10: np.NaN,
11: '2014-07-05',
12: np.NaN,
13: '1989-03-29',
14: np.NaN},
'Operational Address': {2: 'Omladinskih Brigada 90V 11070 BEOGRAD SERBIA,',
3: 'REPUBLIC OF',
4: 'SAFAT ALI AL SALEM STREET, MUBARAKIA',
5: 'OPPOSITE PUBLIC LIBRARY 13022 KUWAIT',
6: 'KUWAIT',
7: 'CHEZ NSM 3, AVENUE HOCHE 75008 PARIS',
8: 'FRANCE',
9: '10 RUE DU COLISEE 75008 PARIS FRANCE',
10: np.NaN,
11: '152, 6TH OF SEPTEMBER BLVD. BUSINESS',
12: 'CENTER LEGIS 4000 PLOVDIV BULGARIA',
13: 'DHABAB STREET HEAD OFFICE BUILDING 11431',
14: 'RIYADH SAUDI ARABIA'},
'BIC': {2: 'AAAARSBG',
3: np.NaN,
4: 'AAACKWKW',
5: np.NaN,
6: np.NaN,
7: 'AAADFRP1',
8: np.NaN,
9: 'AAAGFRP1',
10: np.NaN,
11: 'AAAJBG21',
12: np.NaN,
13: 'AAALSARI',
14: np.NaN}}
df = pd.DataFrame(data=data)
Upvotes: 3
Views: 1071
Reputation: 59579
Use cumsum
to form groups and specify a dictionary for aggregation for each column.
agg_d = {'Record': 'first',
'Operational Address': lambda x: ' '.join(x.dropna()),
'BIC': 'first'}
df.groupby(df.Record.notnull().cumsum().rename(None)).agg(agg_d)
Record Operational Address BIC
1 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, REPUBLIC OF AAAARSBG
2 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA OPPOSITE PUBLIC LIBRARY 13022 KUWAIT KUWAIT AAACKWKW
3 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS FRANCE AAADFRP1
4 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
5 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS CENTER LEGIS 4000 PLOVDIV BULGARIA AAAJBG21
6 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 RIYADH SAUDI ARABIA AAALSARI
Upvotes: 2