Reputation: 13
Hi stackoverflow Community!
I have the set of data:
0 A 0.000027769231 1 B 0.000030287440 0.628306 0.988151 1
0 A 0.000027479497 2 C 0.000035937793 0.581428 0.976041 1
1 B 0.000030287440 2 C 0.000035532483 0.516033 0.987388 1
4 D 0.000011085990 5 E 0.000008163211 0.577556 0.943583 1
4 D 0.000010787916 8 F 0.000008873166 0.531686 0.954017 1
5 E 0.000007865264 8 F 0.000008873166 0.691516 0.989945 1
311 G 0.000006216949 312 H 0.000002510852 0.829361 0.983148 1
326 M 0.000028129783 327 N 0.000011022112 0.843188 0.915627 1
326 M 0.000027462953 328 O 0.000002167529 1.742349 0.943267 1
326 M 0.000028024026 329 P 0.000005130416 1.263187 0.924010 1
326 M 0.000027630314 330 R 0.000002965539 1.668906 0.935518 1
326 M 0.000027721668 331 S 0.000002614498 1.851544 0.939051 1
326 M 0.000028129332 332 T 0.000003145471 1.742525 0.930186 1
327 N 0.000011020065 328 O 0.000002570277 2.473902 0.943474 1
327 N 0.000011028065 329 P 0.000005235456 1.447848 0.976569 1
327 N 0.000011032158 330 R 0.000003154471 2.303768 0.955479 1
327 N 0.000011025788 331 S 0.000002864823 2.038783 0.946972 1
327 N 0.000011064135 332 T 0.000003183160 1.213611 0.975056 1
328 O 0.000002505234 329 P 0.000005129224 1.549313 0.968629 1
328 O 0.000002452331 330 R 0.000002965465 2.328536 0.981076 1
329 P 0.000005147180 330 R 0.000003095314 2.803627 0.977268 1
329 P 0.000005208069 332 T 0.000003147536 2.658807 0.984912 1
330 R 0.000002967887 331 S 0.000002700052 1.208673 0.987825 1
330 R 0.000003110114 332 T 0.000003145140 2.428988 0.983747 1
331 S 0.000002853757 332 T 0.000003145464 1.551457 0.982276 1
366 I 0.000000326315 367 J 0.000000253986 1.410176 0.961879 1
366 I 0.000000327483 368 K 0.000000110327 1.236265 0.918510 1
366 I 0.000000326939 369 Q 0.000000165208 2.258098 0.907039 1
367 J 0.000000257330 368 K 0.000000113511 2.600934 0.907874 1
367 J 0.000000256872 369 Q 0.000000166861 1.102368 0.937099 1
In each row I have an unique pair of some elements that I indicated here as a letters. I want to create groups of these elements and choose the largest value from column 3 or 6 in each group. For this dataset I should get 4 groups with elements and max value from column 3 or 6:
A
B
C
maxval: C: 0.000035937793
D
E
F
maxval: D: 0.000011085990
G
H
maxval: G: 0.000006216949
M
N
O
P
R
S
T
maxval: M: 0.000028129783
I
J
K
Q
maxval: I: 0.000000326939
As you can notice, if in rows there are more than one the same element (e.g. A), values in column 3 (for A) are a little bit different. However, we can make an assumption that A has the same value of column 3 in every cases.
As an output I want to get three files:
2 C
4 D
311 G
326 M
366 I
0 A
1 B
5 E
8 F
312 H
327 N
328 O
329 P
330 R
331 S
332 T
367 J
368 K
369 Q
I have no idea how to do such a case in Python. Can anyone help me with some advices or parts of code?
Upvotes: 0
Views: 83
Reputation: 1497
I am not sure if I exactly answer what you want, some parts are unclear to me, but probably small adjustments can be easily made within the loop.
With help of pandas
and numpy
,
import pandas as pd
import numpy as np
We can load the data
data = pd.read_csv("data.txt", sep=" ", header=None)
And define a function
# https://stackoverflow.com/questions/39915402/combine-a-list-of-pairs-tuples
def make_equiv_classes(pairs):
groups = {}
for (x, y) in pairs:
xset = groups.get(x, set([x]))
yset = groups.get(y, set([y]))
jset = xset | yset
for z in jset:
groups[z] = jset
return set(map(tuple, groups.values()))
And create our classes
classes = make_equiv_classes( data.values[:,[1,4]] )
Then for each class
for cls in classes:
max_cls = 0
print(sorted(cls))
sub_class = data.loc[data[1].isin(cls) | data[4].isin(cls)]
max_class_value = np.max( sub_class.values[:,[2,5]] )
subclass_argmax = np.argmax( np.max( sub_class.values[:,[2,5]], axis=1) )
data_argmax = sub_class.iloc[subclass_argmax][0]
first_letter = sub_class.iloc[subclass_argmax][1]
second_letter = sub_class.iloc[subclass_argmax][4]
print( "Max Class Value: {}".format(max_class_value))
print( "Max Class Number: {}".format(data_argmax))
print( "First letter: {}, Second Letter: {}".format(first_letter, second_letter))
print( "\n")
it will print
['M', 'N', 'O', 'P', 'R', 'S', 'T']
Max Class Value: 2.8129783000000003e-05
Max Class Number: 326
First letter: M, Second Letter: N
['G', 'H']
Max Class Value: 6.216949e-06
Max Class Number: 311
First letter: G, Second Letter: H
['D', 'E', 'F']
Max Class Value: 1.108599e-05
Max Class Number: 4
First letter: D, Second Letter: E
['I', 'J', 'K', 'Q']
Max Class Value: 3.27483e-07
Max Class Number: 366
First letter: I, Second Letter: K
['A', 'B', 'C']
Max Class Value: 3.5937793e-05
Max Class Number: 0
First letter: A, Second Letter: C
Upvotes: 1