Reputation: 1185
I have a data frame like in example:
Sample_name Signature Len
A 1 10
A 2 10
B 1 10
B 2 10
B 3 10
C 1 10
D 1 10
D 2 10
D 3 10
D 4 10
E 1 10
E 2 10
F 1 10
F 2 10
F 3 10
F 4 10
G 1 10
So in that example DF I have 7 different Samples. It's A,B,C,D,E,F,G. I need to create from this data frame a few smaller but based on special condition. Each new data frames should contain 2 SAMPLE.
So in this case result should be a 4 Data Frames. First with all records for A and B, second C, D; third E, F.. and last, because is not enough Samples should be just G.
Expected result:
new df1:
A 1 10
A 2 10
B 1 10
B 2 10
B 3 10
new df2:
C 1 10
D 1 10
D 2 10
D 3 10
D 4 10
new df3:
E 1 10
E 2 10
F 1 10
F 2 10
F 3 10
F 4 10
new df4:
G 1 10
So as you can see, I can't just divide df by row numbers because I have a different row number for each Sample. I tried to do that by for loop but it is really slow and throw errors (memory, key, shape). DF has 15 mln records. 84k Samples. I read a lot of similar posts on SO but nothing fit to that problem.
Maybe someone has a good idea to do that?
Upvotes: 2
Views: 100
Reputation: 863291
Use factorize
with integer division for groups and convert groupby
object to dictionary or list:
print (pd.factorize(df['Sample_name'])[0])
[0 0 1 1 1 2 3 3 3 3 4 4 5 5 5 5 6]
print (pd.factorize(df['Sample_name'])[0] // 2)
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 3]
#output is dict
dfs = dict(tuple(df.groupby(pd.factorize(df['Sample_name'])[0] // 2)))
#output is list
#dfs = [x for _, x in df.groupby(pd.factorize(df['Sample_name'])[0] // 2)]
print (dfs)
{0: Sample_name Signature Len
0 A 1 10
1 A 2 10
2 B 1 10
3 B 2 10
4 B 3 10, 1: Sample_name Signature Len
5 C 1 10
6 D 1 10
7 D 2 10
8 D 3 10
9 D 4 10, 2: Sample_name Signature Len
10 E 1 10
11 E 2 10
12 F 1 10
13 F 2 10
14 F 3 10
15 F 4 10, 3: Sample_name Signature Len
16 G 1 10}
print (dfs[0])
Sample_name Signature Len
0 A 1 10
1 A 2 10
2 B 1 10
3 B 2 10
4 B 3 10
Upvotes: 3