Reputation: 275
I have built up a dictionary of dataframes containing similar data imported from many excel sheets. However, the data are a bit messy and for each DF I have a header of data I need to remove (like metadata). The problem is that this header of useless data is not always the same in terms of length so I cannot use always the same number of rows to drop while looping through these DFs.
The only common thing across all the DFs is that between this messy data and the data I need(tabular data) there is an empty excel row. So my idea was to loop through all the DFs in this dictionary and ask starting from the first row to drop all of them until and empty row is met. Once you meet the empty row I still cancel this and then I exist the loop. I hope it is clear. Any help would be more than appreciated. BR Luigi
Upvotes: 0
Views: 749
Reputation: 1776
In pandas, empty values are represented with np.nan
. For a single dataframe, you can use pd.isnull
with all(axis=1)
to find a whole empty row. Then, you can use idxmax
to get the first row where that is true (if you have more than 1 empty row, you'll want the first one, right?), and then loc
with a :
to get the "rest". Like so,
df = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
print(df)
cola colb
0 100.0 1.0
1 99.0 NaN
2 98.0 3.0
3 NaN NaN <- This is the row we want
4 96.0 5.0
5 NaN NaN <- Not thie one
rest = df.iloc[pd.isnull(df).all(axis=1).idxmax() + 1:, :]
print(rest)
cola colb
4 96.0 5.0
5 NaN NaN
In terms of doing it to mutiple dataframes in a dictionary, you can simply iterate over them and repeat the previous method.
# Sample data
df1 = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
df2 = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
dct = {'first': df1, 'second': df2}
# Solution
out_dict = {}
for key, frame in dct.items():
new_frame = frame.iloc[pd.isnull(frame).all(axis=1).idxmax() + 1:, :].reset_index(drop=True)
out_dict[key] = new_frame
out_dict
now contains your desired dataframes.
Upvotes: 1