Feva
Feva

Reputation: 37

How to split a dataframe and select all possible pairs?

I have a dataframe that I want to separate in order to apply a certain function.

I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.

I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.

I'd appreciate all help!

Edit: I ended up solving it like this:

  1. I split the original dataframe into multiple by using df.groupby()

    dfsplit=df.groupby(['beam','track','cycle'])

    This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples

  2. I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()

    keys=list(itertools.combinations(dfsplit.keys(),2))

This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.

  1. I removed the combinations where 'track' was the same through a for loop

    for k in keys.copy():
    
      if k[0][1]==k[1][1]:
    
         keys.remove(k)
    

Now I can call my function by looping through the list of combinations

for k in keys:
 function(dfsplit[k[0]],dfsplit[k[1]])

Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.

EDIT 2: Solved the problem with step 3, once again with itertools, just by doing

keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))

itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.

Upvotes: 1

Views: 97

Answers (1)

Feva
Feva

Reputation: 37

I don't know how to mark the question as solved so I'll just repeat the solution here:

dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
   function(dfsplit[k[0]],dfsplit[k[1]])

Upvotes: 1

Related Questions