Reputation: 31
I'm looking for an elegant approach to the following problem.
Working from a DataFrame with 15 columns and 1250 rows of chemical compound information (1250 compounds). One particular column named "molecular_mass" has numbers which I would like to use as a handle to create groups containing 100 compounds each, where the "molecular_mass" numbers of each compound in the group can't be within +/- 1 of any other number in that group.
I'm performing the following to get randomized groups of 100, but this doesn't help me with my problem of keeping the "molecular_mass" numbers +/- 1 apart from any other number in the group.
import pandas as pd
df=pd.read_csv('data.csv')
df=df.sample(frac=1).reset_index(drop=TRUE)
SIZE=100
df['group']=df.index // SIZE
groups=[
df[df['group'] == num]
for num in range (df['group'].max()+1)]
Adding a few example lines from data.csv
Compound | molecular_mass | Plate | Column | Row | Solubility |
---|---|---|---|---|---|
AAA | 74.12 | 1 | 1 | A | 100/0 |
BBB | 74.12 | 3 | 4 | D | 100/0 |
CCC | 76.12 | 2 | 5 | H | 80/20 |
DDD | 120.3 | 6 | 10 | F | 50/50 |
EEE | 121.3 | 1 | 1 | B | 100/0 |
FFF | 119.3 | 1 | 1 | C | 100/0 |
GGG | 150.3 | 5 | 13 | D | 100/0 |
The data.csv is in the format (6 most important columns shown).
Upvotes: 3
Views: 161
Reputation: 31
(First post in SO. Help me out on what I can do better. This is NOT an numpy answer, but I'm hoping this helps anyway)
For my solution I'd create an empty dict to store the groups of compounds and loop over the data given in the table while checking (or iterating) each group for the two conditions you described.
If the current iterated group doesn't present any failed condition, it will append the compound to the group, else it will go to the next group or create a new group on the dict.
Since I wasn't sure I'd be able to do this in numpy, I've created a Data class to act as an iterable that returns both the compound name and the compound mass. I'm hoping you can convert the ideas here to your code easily.
with the code below i got the following result
{0: ['A', 'C', 'D', 'G'], 1: ['B', 'E', 'F']}
class Data:
compound = ["A", "B", "C", "D", "E", "F", "G"]
mass = [74.12, 74.12, 76.12, 120.3, 121.3, 119.3, 150.3]
def __init__(self):
assert len(self.compound) == len(self.mass)
self._current_index = 0
def __len__(self):
return len(self.compound)
def __getitem__(self, item):
member = self.compound[item], self.mass[item]
return member
MAX_COMPOUNDS_PER_GROUP = 100
MAX_WEIGHT_DELTA_IN_GROUP = 1
data = Data()
groups = dict()
for dd in data:
print(dd)
if len(groups) == 0:
print(f"group 0 created. added {dd[0]} to it")
groups[0] = [dd[0]]
continue
for group, comps_in_group in groups.items():
if len(comps_in_group) >= MAX_COMPOUNDS_PER_GROUP:
continue
for comp in comps_in_group:
comp_idx = data.compound.index(comp)
comp_mas = data.mass[comp_idx]
if abs(dd[1] - comp_mas) <= MAX_WEIGHT_DELTA_IN_GROUP:
break
else:
print(f"no mass conflict. appending {dd[0]} to group {group}")
comps_in_group.append(dd[0])
break
else:
print(
f"no group available. creating group {len(groups)} and adding"
f"{dd[0]}"
)
groups[len(groups)] = [dd[0]]
print(groups)
Upvotes: 1