VlS
VlS

Reputation: 586

Splitting data in Pandas/Python

I'm new to Python and Pandas so bear with me.

I have a big data that looks like:

1       E                    1                  NaN
2       T  2004-09-21 01:15:53                  NaN
3       U                   30                  NaN
4       N                   32                  NaN
5       V                    1  2004-09-14 16:26:00
6       V                   -1  2004-09-14 16:53:00
7       V                    1  2004-09-14 17:08:00
...................................................
18      E                    1                  Nan
19      T  2004-10-21 02:13:43                  Nan
20      U                   35                  Nan
21      N                   40                  Nan
22      V                    1  2004-10-19 14:50:00
23      V                    1  2004-10-20 15:31:00
24      V                    1  2004-10-21 13:49:00
25      V                    1  2004-10-21 20:57:00
26      V                    1  2004-10-21 22:11:00
...................................................

How can I split this into individual little data sets, lets say x(i) , where i=0,...,N, and for example x(0) looks like:

 1       E                    1                  NaN
 2       T  2004-09-21 01:15:53                  NaN
 3       U                   30                  NaN
 4       N                   32                  NaN
 5       V                    1  2004-09-14 16:26:00
 6       V                   -1  2004-09-14 16:53:00
 7       V                    1  2004-09-14 17:08:00
 ...................................................
 17      V                    1  2004-09-16 12:38:01

I guess I should use some loop command for going from E to E, but I'm not quite sure how to divide it into individual sets.

Upvotes: 1

Views: 221

Answers (2)

DSM
DSM

Reputation: 353199

You can use groupby here, using the compare-cumsum-groupby pattern (here let's say that the column with the Es is called "letter"):

>>> grouped = df.groupby((df["letter"] == "E").cumsum())
>>> frames = [g for k,g in grouped]
>>> for frame in frames:
...     print(frame)
...     print("--")
...     
  letter
0      E
1      T
2      U
--
  letter
3      E
4      M
--
  letter
5      E
--
  letter
6      E
--

This works because we can compare everything to E, creating a Series of booleans:

>>> df["letter"] == "E"
0     True
1    False
2    False
3     True
4    False
5     True
6     True
Name: letter, dtype: bool

and then if we take the cumulative sum of that we get

>>> (df["letter"] == "E").cumsum()
0    1
1    1
2    1
3    2
4    2
5    3
6    4
Name: letter, dtype: int32

where each new group has its own number. Reading the split-apply-combine section of the documentation is probably a good idea-- you might not even need to break everything up into subframes if the operation you want to perform on the groups is already supported.

Upvotes: 1

hellpanderr
hellpanderr

Reputation: 5896

You can first find indexes of rows with that value, then you can iterate through them and get slices of the dataframe:

def break_df_by(df, column, value):  
    ret = []
    indexes = df[df.ix[:, column] == value].index #[2, 5, 10] in the example below
    for n,i in enumerate(indexes):
        if n == 0: #if we want values before first 'E'
            ret.append(df[:i])
        elif n == len(indexes) - 1: #if we want values after last 'E'
            ret.append(df[i:])
        elif n > 0:
            ret.append(df[indexes[n-1]:i])

    return ret

An example, dataframe is:

    1
0   T
1   V
2   E
3   T
4   V
5   E
6   T
7   U
8   N
9   V
10  E
11  T
12  U
13  N

Result:

break_df_by(df, 1, 'E')

[   1
 0  T
 1  V,    1
 2  E
 3  T
 4  V,    1
 5  E
 6  T
 7  U
 8  N
 9  V,     1
 10  E
 11  T
 12  U
 13  N]

Upvotes: 0

Related Questions