How to select rows of a Pandas dataframe that meet a condition and create new dataframes with the results

Question

How can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?

I have a Pandas dataframe (the hospital.dat_ data from R's networkDynamicData package). I read it in and process it as a Pandas DataFrame like this:

    with open('hospital.dat_', 'r') as f:
            data = f.readlines()

    #Convert to a pandas dataframe for easier manipulation

    split_data = []

    for line in data:
            split_line = line.split('	')
            split_data.append(split_line)
    df = pd.DataFrame(split_data)

    df.columns=['Time', 'ID1', 'ID2', 'Att1', 'Att2'] #assign column names

The DataFrame df looks like this:

print df.head()

  Time   ID1   ID2 Att1   Att2
0  140  1157  1232  MED  ADM

1  160  1157  1191  MED  MED

2  500  1157  1159  MED  MED

3  520  1157  1159  MED  MED

4  560  1159  1191  MED  MED

I would like to split the DataFrame into sections based on Time. As an MWE, suppose I want a dict object with three DataFrame values:

one that contains all rows for entries of Time less than 200;
one that contains all rows for entries of Time greater than 200 but less than 550;
one that contains all rows for entries of Time greater than 550 but less than 600 (all that's left)
the time index of the start to be the key.

That is, I want:

0:

  Time   ID1   ID2 Att1   Att2
0  140  1157  1232  MED  ADM

1  160  1157  1191  MED  MED

200:

  Time   ID1   ID2 Att1   Att2
0  500  1157  1159  MED  MED

1  520  1157  1159  MED  MED

550:

  Time   ID1   ID2 Att1   Att2
0  560  1159  1191  MED  MED

I have a list that contains the split values, with 0, and a max value at the end:

my_list = [0, 200, 550, 600]

As a start, I've tried:

   sorted_df = {i : [df.loc[(df['Time'] > i) & (df['Time'] <= j)] for i, j in enumerate(my_list)]}

but that gives me only one key, because I keep overwriting, and the values are empty lists.

Question: how can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?

jezrael · Accepted Answer

You can use cut for bins and then convert groupby object to dict:

df = pd.read_table('hospital.dat_')

cat= pd.cut(df['Time'], bins=my_list, labels = my_list[:-1], include_lowest=True)
print (cat)
0      0
1      0
2    200
3    200
4    550
Name: Time, dtype: category
Categories (3, int64): [0 < 200 < 550]

dfs = dict(tuple(df.groupby(cat)))
print (dfs)
{0:    Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED, 200:    Time   ID1   ID2 Att1 Att2
2   500  1157  1159  MED  MED
3   520  1157  1159  MED  MED, 550:    Time   ID1   ID2 Att1 Att2
4   560  1159  1191  MED  MED}

print (dfs[0])
   Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED

print (dfs[200])
   Time   ID1   ID2 Att1 Att2
2   500  1157  1159  MED  MED
3   520  1157  1159  MED  MED

Also is possible create default indices by dict comprehension:

dfs = {k:v.reset_index(drop=True) for k, v in df.groupby(cat)}
print (dfs[0])
   Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED

print (dfs[200])
   Time   ID1   ID2 Att1 Att2
0   500  1157  1159  MED  MED
1   520  1157  1159  MED  MED

If want solution without cut/groupby you can zip all values of list without last with all values without first and use dict comprehension:

zipped = zip(my_list[:-1], my_list[1:])
print (list(zipped))
[(0, 200), (200, 550), (550, 600)]

sorted_df = {i : df.loc[(df['Time'] > i) & (df['Time'] <= j)].reset_index(drop=True) 
                 for i, j in zipped}
print (sorted_df)
{0:    Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED, 200:    Time   ID1   ID2 Att1 Att2
0   500  1157  1159  MED  MED
1   520  1157  1159  MED  MED, 550:    Time   ID1   ID2 Att1 Att2
0   560  1159  1191  MED  MED}

How to select rows of a Pandas dataframe that meet a condition and create new dataframes with the results

Answers (2)

Related Questions