Reputation: 3099
How can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?
I have a Pandas dataframe (the hospital.dat_ data from R's networkDynamicData package). I read it in and process it as a Pandas DataFrame like this:
with open('hospital.dat_', 'r') as f:
data = f.readlines()
#Convert to a pandas dataframe for easier manipulation
split_data = []
for line in data:
split_line = line.split('\t')
split_data.append(split_line)
df = pd.DataFrame(split_data)
df.columns=['Time', 'ID1', 'ID2', 'Att1', 'Att2'] #assign column names
The DataFrame df
looks like this:
print df.head()
Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM\n
1 160 1157 1191 MED MED\n
2 500 1157 1159 MED MED\n
3 520 1157 1159 MED MED\n
4 560 1159 1191 MED MED\n
I would like to split the DataFrame
into sections based on Time
. As an MWE, suppose I want a dict
object with three DataFrame
values:
Time
less than 200; Time
greater than 200 but less than 550;Time
greater than 550 but less than 600 (all that's left)That is, I want:
0:
Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM\n
1 160 1157 1191 MED MED\n
200:
Time ID1 ID2 Att1 Att2
0 500 1157 1159 MED MED\n
1 520 1157 1159 MED MED\n
550:
Time ID1 ID2 Att1 Att2
0 560 1159 1191 MED MED\n
I have a list that contains the split values, with 0, and a max value at the end:
my_list = [0, 200, 550, 600]
As a start, I've tried:
sorted_df = {i : [df.loc[(df['Time'] > i) & (df['Time'] <= j)] for i, j in enumerate(my_list)]}
but that gives me only one key, because I keep overwriting, and the values are empty lists.
Question: how can I take a Pandas DataFrame, split it according to a condition, and save the resulting DataFrames as values in a dict object with pre-specified keys?
Upvotes: 2
Views: 715
Reputation: 862671
You can use cut
for bins and then convert groupby
object to dict
:
df = pd.read_table('hospital.dat_')
cat= pd.cut(df['Time'], bins=my_list, labels = my_list[:-1], include_lowest=True)
print (cat)
0 0
1 0
2 200
3 200
4 550
Name: Time, dtype: category
Categories (3, int64): [0 < 200 < 550]
dfs = dict(tuple(df.groupby(cat)))
print (dfs)
{0: Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM
1 160 1157 1191 MED MED, 200: Time ID1 ID2 Att1 Att2
2 500 1157 1159 MED MED
3 520 1157 1159 MED MED, 550: Time ID1 ID2 Att1 Att2
4 560 1159 1191 MED MED}
print (dfs[0])
Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM
1 160 1157 1191 MED MED
print (dfs[200])
Time ID1 ID2 Att1 Att2
2 500 1157 1159 MED MED
3 520 1157 1159 MED MED
Also is possible create default indices by dict comprehension:
dfs = {k:v.reset_index(drop=True) for k, v in df.groupby(cat)}
print (dfs[0])
Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM
1 160 1157 1191 MED MED
print (dfs[200])
Time ID1 ID2 Att1 Att2
0 500 1157 1159 MED MED
1 520 1157 1159 MED MED
If want solution without cut/groupby
you can zip
all values of list without last with all values without first and use dict comprehension
:
zipped = zip(my_list[:-1], my_list[1:])
print (list(zipped))
[(0, 200), (200, 550), (550, 600)]
sorted_df = {i : df.loc[(df['Time'] > i) & (df['Time'] <= j)].reset_index(drop=True)
for i, j in zipped}
print (sorted_df)
{0: Time ID1 ID2 Att1 Att2
0 140 1157 1232 MED ADM
1 160 1157 1191 MED MED, 200: Time ID1 ID2 Att1 Att2
0 500 1157 1159 MED MED
1 520 1157 1159 MED MED, 550: Time ID1 ID2 Att1 Att2
0 560 1159 1191 MED MED}
Upvotes: 1
Reputation: 231
When you create the data frame via read_csv
you should be able to directly access the rows via their time value:
df = pd.read_csv('hospital.dat_', delimiter='\t', index_col=['Time'])
d1 = df[140:199]
d2 = df[200:549]
I am not sure on the delimiter required here, though.
Upvotes: 0