Luigi87
Luigi87

Reputation: 275

How to delete rows from a DF until an empty row is found (Python)

I have built up a dictionary of dataframes containing similar data imported from many excel sheets. However, the data are a bit messy and for each DF I have a header of data I need to remove (like metadata). The problem is that this header of useless data is not always the same in terms of length so I cannot use always the same number of rows to drop while looping through these DFs.

The only common thing across all the DFs is that between this messy data and the data I need(tabular data) there is an empty excel row. So my idea was to loop through all the DFs in this dictionary and ask starting from the first row to drop all of them until and empty row is met. Once you meet the empty row I still cancel this and then I exist the loop. I hope it is clear. Any help would be more than appreciated. BR Luigi

Upvotes: 0

Views: 749

Answers (1)

Bertil Johannes Ipsen
Bertil Johannes Ipsen

Reputation: 1776

In pandas, empty values are represented with np.nan. For a single dataframe, you can use pd.isnull with all(axis=1) to find a whole empty row. Then, you can use idxmax to get the first row where that is true (if you have more than 1 empty row, you'll want the first one, right?), and then loc with a : to get the "rest". Like so,

df = pd.DataFrame({
    'cola': [100, 99, 98, np.nan, 96, np.nan],
    'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})

print(df)

    cola  colb
0  100.0   1.0
1   99.0   NaN
2   98.0   3.0
3    NaN   NaN <- This is the row we want
4   96.0   5.0
5    NaN   NaN <- Not thie one

rest = df.iloc[pd.isnull(df).all(axis=1).idxmax() + 1:, :]
print(rest)

   cola  colb
4  96.0   5.0
5   NaN   NaN

In terms of doing it to mutiple dataframes in a dictionary, you can simply iterate over them and repeat the previous method.

# Sample data
df1 = pd.DataFrame({
    'cola': [100, 99, 98, np.nan, 96, np.nan],
    'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
df2 = pd.DataFrame({
    'cola': [100, 99, 98, np.nan, 96, np.nan],
    'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
dct = {'first': df1, 'second': df2}

# Solution
out_dict = {}
for key, frame in dct.items():
    new_frame = frame.iloc[pd.isnull(frame).all(axis=1).idxmax() + 1:, :].reset_index(drop=True)
    out_dict[key] = new_frame

out_dict now contains your desired dataframes.

Upvotes: 1

Related Questions