Gordon
Gordon

Reputation: 31

Python - grouping with conditions

I'm looking for an elegant approach to the following problem.

Working from a DataFrame with 15 columns and 1250 rows of chemical compound information (1250 compounds). One particular column named "molecular_mass" has numbers which I would like to use as a handle to create groups containing 100 compounds each, where the "molecular_mass" numbers of each compound in the group can't be within +/- 1 of any other number in that group.

I'm performing the following to get randomized groups of 100, but this doesn't help me with my problem of keeping the "molecular_mass" numbers +/- 1 apart from any other number in the group.

import pandas as pd
    df=pd.read_csv('data.csv')
    df=df.sample(frac=1).reset_index(drop=TRUE)
    SIZE=100
    df['group']=df.index // SIZE
    groups=[
    df[df['group'] == num]
    for num in range (df['group'].max()+1)]

Adding a few example lines from data.csv

Compound molecular_mass Plate Column Row Solubility
AAA 74.12 1 1 A 100/0
BBB 74.12 3 4 D 100/0
CCC 76.12 2 5 H 80/20
DDD 120.3 6 10 F 50/50
EEE 121.3 1 1 B 100/0
FFF 119.3 1 1 C 100/0
GGG 150.3 5 13 D 100/0

The data.csv is in the format (6 most important columns shown).

Upvotes: 3

Views: 161

Answers (1)

M Nastri
M Nastri

Reputation: 31

(First post in SO. Help me out on what I can do better. This is NOT an numpy answer, but I'm hoping this helps anyway)

For my solution I'd create an empty dict to store the groups of compounds and loop over the data given in the table while checking (or iterating) each group for the two conditions you described.

If the current iterated group doesn't present any failed condition, it will append the compound to the group, else it will go to the next group or create a new group on the dict.

Since I wasn't sure I'd be able to do this in numpy, I've created a Data class to act as an iterable that returns both the compound name and the compound mass. I'm hoping you can convert the ideas here to your code easily.

with the code below i got the following result
{0: ['A', 'C', 'D', 'G'], 1: ['B', 'E', 'F']}

class Data:
    compound = ["A", "B", "C", "D", "E", "F", "G"]
    mass = [74.12, 74.12, 76.12, 120.3, 121.3, 119.3, 150.3]

    def __init__(self):
        assert len(self.compound) == len(self.mass)
        self._current_index = 0

    def __len__(self):
        return len(self.compound)

    def __getitem__(self, item):
        member = self.compound[item], self.mass[item]
        return member


MAX_COMPOUNDS_PER_GROUP = 100
MAX_WEIGHT_DELTA_IN_GROUP = 1

data = Data()
groups = dict()
for dd in data:
    print(dd)
    if len(groups) == 0:
        print(f"group 0 created. added {dd[0]} to it")
        groups[0] = [dd[0]]
        continue
    for group, comps_in_group in groups.items():
        if len(comps_in_group) >= MAX_COMPOUNDS_PER_GROUP:
            continue
        for comp in comps_in_group:
            comp_idx = data.compound.index(comp)
            comp_mas = data.mass[comp_idx]
            if abs(dd[1] - comp_mas) <= MAX_WEIGHT_DELTA_IN_GROUP:
                break
        else:
            print(f"no mass conflict. appending {dd[0]} to group {group}")
            comps_in_group.append(dd[0])
            break
    else:
        print(
            f"no group available. creating group {len(groups)} and adding" 
            f"{dd[0]}"
        )
        groups[len(groups)] = [dd[0]]
print(groups)

Upvotes: 1

Related Questions