Reputation: 586
I'm new to Python and Pandas so bear with me.
I have a big data that looks like:
1 E 1 NaN
2 T 2004-09-21 01:15:53 NaN
3 U 30 NaN
4 N 32 NaN
5 V 1 2004-09-14 16:26:00
6 V -1 2004-09-14 16:53:00
7 V 1 2004-09-14 17:08:00
...................................................
18 E 1 Nan
19 T 2004-10-21 02:13:43 Nan
20 U 35 Nan
21 N 40 Nan
22 V 1 2004-10-19 14:50:00
23 V 1 2004-10-20 15:31:00
24 V 1 2004-10-21 13:49:00
25 V 1 2004-10-21 20:57:00
26 V 1 2004-10-21 22:11:00
...................................................
How can I split this into individual little data sets, lets say x(i)
, where i=0,...,N
, and for example x(0)
looks like:
1 E 1 NaN
2 T 2004-09-21 01:15:53 NaN
3 U 30 NaN
4 N 32 NaN
5 V 1 2004-09-14 16:26:00
6 V -1 2004-09-14 16:53:00
7 V 1 2004-09-14 17:08:00
...................................................
17 V 1 2004-09-16 12:38:01
I guess I should use some loop command for going from E
to E
, but I'm not quite sure how to divide it into individual sets.
Upvotes: 1
Views: 221
Reputation: 353199
You can use groupby
here, using the compare-cumsum-groupby pattern (here let's say that the column with the Es is called "letter"):
>>> grouped = df.groupby((df["letter"] == "E").cumsum())
>>> frames = [g for k,g in grouped]
>>> for frame in frames:
... print(frame)
... print("--")
...
letter
0 E
1 T
2 U
--
letter
3 E
4 M
--
letter
5 E
--
letter
6 E
--
This works because we can compare everything to E, creating a Series of booleans:
>>> df["letter"] == "E"
0 True
1 False
2 False
3 True
4 False
5 True
6 True
Name: letter, dtype: bool
and then if we take the cumulative sum of that we get
>>> (df["letter"] == "E").cumsum()
0 1
1 1
2 1
3 2
4 2
5 3
6 4
Name: letter, dtype: int32
where each new group has its own number. Reading the split-apply-combine section of the documentation is probably a good idea-- you might not even need to break everything up into subframes if the operation you want to perform on the groups is already supported.
Upvotes: 1
Reputation: 5896
You can first find indexes of rows with that value, then you can iterate through them and get slices of the dataframe:
def break_df_by(df, column, value):
ret = []
indexes = df[df.ix[:, column] == value].index #[2, 5, 10] in the example below
for n,i in enumerate(indexes):
if n == 0: #if we want values before first 'E'
ret.append(df[:i])
elif n == len(indexes) - 1: #if we want values after last 'E'
ret.append(df[i:])
elif n > 0:
ret.append(df[indexes[n-1]:i])
return ret
An example, dataframe is:
1
0 T
1 V
2 E
3 T
4 V
5 E
6 T
7 U
8 N
9 V
10 E
11 T
12 U
13 N
Result:
break_df_by(df, 1, 'E')
[ 1
0 T
1 V, 1
2 E
3 T
4 V, 1
5 E
6 T
7 U
8 N
9 V, 1
10 E
11 T
12 U
13 N]
Upvotes: 0