Reputation: 183
There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?
Upvotes: 8
Views: 7134
Reputation: 2167
Improving other answer that support multiple rows with NaNs
:
from IPython.display import display
import pandas as pd
def split_df_if_row_full_nans(df, reset_header=False):
# grouping
df = (df
.assign(_nan_all_cols=df.isnull().all(axis=1))
.assign(_group_no=lambda df_: df_._nan_all_cols.cumsum())
.query('_nan_all_cols == False') # Drop rows where _nan_all_cols is True
.drop(columns=['_nan_all_cols']) # Drop the _nan_all_cols column
.reset_index(drop=True)
)
# splitting
dfs = {df.iloc[rows[0],0]: (df
.iloc[rows]
.drop(columns=['_group_no'])
)
for _, rows in df.groupby('_group_no').groups.items()}
if reset_header:
# rename column and set index
for k, v in dfs.items():
dfs[k] = (v
.rename(columns=v.iloc[0])
.drop(index=v.index[0])
)
# TODO: this part seems to only works if length of the df is > 1
# dfs[k].set_index(dfs[k].columns[0], drop=True, inplace=True)
# # display
# for df in dfs.values():
# display(df)
return dfs
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
for df in split_df_if_row_full_nans(sample_df1).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 3.0 3.0 3.0
# 3 4 4 4
for df in split_df_if_row_full_nans(sample_df2).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 2 NaN 3 NaN
# 3 3 NaN 3
# 4 4 4 4
NOTE: This approach use .isnull().all(axis=1)
, that is only split if all value is NaN
.
Upvotes: 0
Reputation: 962
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle
for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]
Upvotes: 1
Reputation: 814
Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN
. I don't know if there is such a test builtin in pandas
, but pandas
has a test to check if a Series is full of NaN
. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series
" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna()
at the end, because you will still have the NaN
rows in your result.
Upvotes: 11