Reputation: 1336
I got a weird one today. I am scraping several thousand PDFs using Tabula-py and, for whatever reason, the same table (different PDF) which has wrapped text can be auto-merged based on the tables actual split but on other occasion the pandas dataframe will have many NaN rows to account account for the wrapped text. Generally the ratio is 50:1 are merged. So it makes since to automate the merging process. Here is the example:
Desired DataFrame:
Column1 | Column2 | Column3
A Many Many ... Lots and ... This keeps..
B lots of text.. Many Texts.. Johns and jo..
C ...
D
Scraped returned Dataframe
Column1 | Column2 | Column3
A Many Many Lots This keeps Just
Nan Many Many and lots Keeps Going!
Nan Texts Nan Nan
B lots of Many Texts John and
Nan text here Johnson inc.
C ...
In this case the text should be merged up, such that "Many Many Many Many Texts" are all in cell A Column1 and so on.
I have solved this problem with the below solution, but it feels very dirty. There are a ton of index settings to avoid having to manage the columns and avoid dropping needed values. Is anyone aware of a better solution?
df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')
Cheers
Upvotes: 1
Views: 711
Reputation: 25239
Try this:
df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)
Out[1390]:
Column1 Column2 \
Unnamed: 0
A Many Many Many Many Texts Lots and lots
B lots of text Many Texts here
Column3
Unnamed: 0
A This keeps Just Keeps Going!
B John and Johnson inc.
Upvotes: 2
Reputation: 150735
Similar to Ben's idea:
# fill the missing index
df.index = df.index.to_series().ffill()
(df.stack() # stack to kill the other NaN values
.groupby(level=(0,1)) # grouby (index, column)
.apply(' '.join) # join those strings
.unstack(level=1) # unstack to get columns back
)
Output:
Column1 Column2 Column3
A Many Many Many Many Texts Lots and lots This keeps Just Keeps Going!
B lots of text Many Texts here John and Johnson inc.
Upvotes: 2
Reputation: 29635
I think you can use ffill
on the index directly in the groupby
. Then use agg
instead of transform
.
# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')},
index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
a b
A a 1
NaN b 2
NaN c 3
B d 4
C e 5
NaN f 6
#then groupby on the filled index and agg
new_df = (df.fillna('')
.groupby(pd.Series(df.index).ffill().values)[df.columns]
.agg(lambda x: ' '.join(x)))
print (new_df)
a b
A a b c 1 2 3
B d 4
C e f 5 6
Upvotes: 0