StatsSorceress
StatsSorceress

Reputation: 3099

How to select rows of a Pandas dataframe that meet a condition and create new dataframes with the results

How can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?

I have a Pandas dataframe (the hospital.dat_ data from R's networkDynamicData package). I read it in and process it as a Pandas DataFrame like this:

    with open('hospital.dat_', 'r') as f:
            data = f.readlines()

    #Convert to a pandas dataframe for easier manipulation

    split_data = []

    for line in data:
            split_line = line.split('\t')
            split_data.append(split_line)
    df = pd.DataFrame(split_data)

    df.columns=['Time', 'ID1', 'ID2', 'Att1', 'Att2'] #assign column names

The DataFrame df looks like this:

print df.head()

  Time   ID1   ID2 Att1   Att2
0  140  1157  1232  MED  ADM\n
1  160  1157  1191  MED  MED\n
2  500  1157  1159  MED  MED\n
3  520  1157  1159  MED  MED\n
4  560  1159  1191  MED  MED\n

I would like to split the DataFrame into sections based on Time. As an MWE, suppose I want a dict object with three DataFrame values:

That is, I want:

0:

  Time   ID1   ID2 Att1   Att2
0  140  1157  1232  MED  ADM\n
1  160  1157  1191  MED  MED\n

200:

  Time   ID1   ID2 Att1   Att2
0  500  1157  1159  MED  MED\n
1  520  1157  1159  MED  MED\n

550:

  Time   ID1   ID2 Att1   Att2
0  560  1159  1191  MED  MED\n

I have a list that contains the split values, with 0, and a max value at the end:

my_list = [0, 200, 550, 600]

As a start, I've tried:

   sorted_df = {i : [df.loc[(df['Time'] > i) & (df['Time'] <= j)] for i, j in enumerate(my_list)]}

but that gives me only one key, because I keep overwriting, and the values are empty lists.

Question: how can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?

Upvotes: 2

Views: 715

Answers (2)

jezrael
jezrael

Reputation: 862671

You can use cut for bins and then convert groupby object to dict:

df = pd.read_table('hospital.dat_')

cat= pd.cut(df['Time'], bins=my_list, labels = my_list[:-1], include_lowest=True)
print (cat)
0      0
1      0
2    200
3    200
4    550
Name: Time, dtype: category
Categories (3, int64): [0 < 200 < 550]

dfs = dict(tuple(df.groupby(cat)))
print (dfs)
{0:    Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED, 200:    Time   ID1   ID2 Att1 Att2
2   500  1157  1159  MED  MED
3   520  1157  1159  MED  MED, 550:    Time   ID1   ID2 Att1 Att2
4   560  1159  1191  MED  MED}

print (dfs[0])
   Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED

print (dfs[200])
   Time   ID1   ID2 Att1 Att2
2   500  1157  1159  MED  MED
3   520  1157  1159  MED  MED

Also is possible create default indices by dict comprehension:

dfs = {k:v.reset_index(drop=True) for k, v in df.groupby(cat)}
print (dfs[0])
   Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED

print (dfs[200])
   Time   ID1   ID2 Att1 Att2
0   500  1157  1159  MED  MED
1   520  1157  1159  MED  MED

If want solution without cut/groupby you can zip all values of list without last with all values without first and use dict comprehension:

zipped = zip(my_list[:-1], my_list[1:])
print (list(zipped))
[(0, 200), (200, 550), (550, 600)]

sorted_df = {i : df.loc[(df['Time'] > i) & (df['Time'] <= j)].reset_index(drop=True) 
                 for i, j in zipped}
print (sorted_df)
{0:    Time   ID1   ID2 Att1 Att2
0   140  1157  1232  MED  ADM
1   160  1157  1191  MED  MED, 200:    Time   ID1   ID2 Att1 Att2
0   500  1157  1159  MED  MED
1   520  1157  1159  MED  MED, 550:    Time   ID1   ID2 Att1 Att2
0   560  1159  1191  MED  MED}

Upvotes: 1

Martin
Martin

Reputation: 231

When you create the data frame via read_csv you should be able to directly access the rows via their time value:

df = pd.read_csv('hospital.dat_', delimiter='\t', index_col=['Time'])
d1 = df[140:199]
d2 = df[200:549]

I am not sure on the delimiter required here, though.

Upvotes: 0

Related Questions