martin
martin

Reputation: 1185

Splitting data frame by irregular groups in Pandas

I have a data frame like in example:

 Sample_name  Signature Len
    A           1         10
    A           2         10
    B           1         10
    B           2         10
    B           3         10
    C           1         10
    D           1         10
    D           2         10
    D           3         10
    D           4         10
    E           1         10
    E           2         10
    F           1         10
    F           2         10
    F           3         10
    F           4         10
    G           1         10

So in that example DF I have 7 different Samples. It's A,B,C,D,E,F,G. I need to create from this data frame a few smaller but based on special condition. Each new data frames should contain 2 SAMPLE.

So in this case result should be a 4 Data Frames. First with all records for A and B, second C, D; third E, F.. and last, because is not enough Samples should be just G.

Expected result:

new df1:

A           1         10
A           2         10
B           1         10
B           2         10
B           3         10

new df2:

 C           1         10
 D           1         10
 D           2         10
 D           3         10
 D           4         10

new df3:

E           1         10
E           2         10
F           1         10
F           2         10
F           3         10
F           4         10

new df4:

    G           1         10

So as you can see, I can't just divide df by row numbers because I have a different row number for each Sample. I tried to do that by for loop but it is really slow and throw errors (memory, key, shape). DF has 15 mln records. 84k Samples. I read a lot of similar posts on SO but nothing fit to that problem.

Maybe someone has a good idea to do that?

Upvotes: 2

Views: 100

Answers (1)

jezrael
jezrael

Reputation: 863291

Use factorize with integer division for groups and convert groupby object to dictionary or list:

print (pd.factorize(df['Sample_name'])[0])
[0 0 1 1 1 2 3 3 3 3 4 4 5 5 5 5 6]

print (pd.factorize(df['Sample_name'])[0] // 2)
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 3]

#output is dict
dfs = dict(tuple(df.groupby(pd.factorize(df['Sample_name'])[0] // 2)))
#output is list
#dfs = [x for _, x in df.groupby(pd.factorize(df['Sample_name'])[0] // 2)]
print (dfs)
{0:   Sample_name  Signature  Len
0           A          1   10
1           A          2   10
2           B          1   10
3           B          2   10
4           B          3   10, 1:   Sample_name  Signature  Len
5           C          1   10
6           D          1   10
7           D          2   10
8           D          3   10
9           D          4   10, 2:    Sample_name  Signature  Len
10           E          1   10
11           E          2   10
12           F          1   10
13           F          2   10
14           F          3   10
15           F          4   10, 3:    Sample_name  Signature  Len
16           G          1   10}

print (dfs[0])
  Sample_name  Signature  Len
0           A          1   10
1           A          2   10
2           B          1   10
3           B          2   10
4           B          3   10

Upvotes: 3

Related Questions