Matteo Mcdonnell
Matteo Mcdonnell

Reputation: 49

Cartesian product of all items for a given group using pandas

I am starting with a DataFrame that looks like this:

       id        tof
0    43.0  1999991.0
1    43.0  2095230.0
2    43.0  4123105.0
3    43.0  5560423.0
4    46.0  2098996.0
5    46.0  2114971.0
6    46.0  4130033.0
7    46.0  4355096.0
8    82.0  2055207.0
9    82.0  2093996.0
10   82.0  4193587.0
11   90.0  2059360.0
12   90.0  2083762.0
13   90.0  2648235.0
14   90.0  4212177.0
15  103.0  1993306.0
          .
          .
          .

and ultimately my goal is to create a very long two dimensional array that contains all combinations of items with the same id like this (for rows with id 43):

[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]

except changing all the tuples to arrays so that I could transpose the array after iterating over all id numbers.

Naturally, itertools came to mind, and my first thought was doing something with df.groupby('id') so that it would apply itertools internally to every group with the same id, but I would guess that this would take absolutely forever with the million line datafiles I have.

Is there a vectorized way to do this?

Upvotes: 0

Views: 545

Answers (3)

piRSquared
piRSquared

Reputation: 294488

IIUC:

from itertools import combinations

pd.DataFrame([
    [k, c0, c1] for k, tof in df.groupby('id').tof
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

      id       tof0       tof1
0   43.0  1999991.0  2095230.0
1   43.0  1999991.0  4123105.0
2   43.0  1999991.0  5560423.0
3   43.0  2095230.0  4123105.0
4   43.0  2095230.0  5560423.0
5   43.0  4123105.0  5560423.0
6   46.0  2098996.0  2114971.0
7   46.0  2098996.0  4130033.0
8   46.0  2098996.0  4355096.0
9   46.0  2114971.0  4130033.0
10  46.0  2114971.0  4355096.0
11  46.0  4130033.0  4355096.0
12  82.0  2055207.0  2093996.0
13  82.0  2055207.0  4193587.0
14  82.0  2093996.0  4193587.0
15  90.0  2059360.0  2083762.0
16  90.0  2059360.0  2648235.0
17  90.0  2059360.0  4212177.0
18  90.0  2083762.0  2648235.0
19  90.0  2083762.0  4212177.0
20  90.0  2648235.0  4212177.0

Explanation

This is a list comprehension that returns a list of lists wrapped up by a dataframe constructor. Look up comprehensions to understand better.

from itertools import combinations

pd.DataFrame([
    #            name   series of tof values
    #               ↓   ↓    
    [k, c0, c1] for k, tof in df.groupby('id').tof
    #    items from combinations
    #      first    second
    #          ↓    ↓
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

Groupby does work:

def get_product(x):
    return pd.MultiIndex.from_product((x.tof, x.tof)).values

for i, g in df.groupby('id'):
    print(i, get_product(g))

Output:

43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
 (1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
 (2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
 (4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
 (5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
 (5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
 (2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
 (2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
 (4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
 (4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
 (4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
 (2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
 (4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
 (2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
 (2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
 (2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
 (4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
 (4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]

Upvotes: 1

Ayoub ZAROU
Ayoub ZAROU

Reputation: 2417

from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))

if you'd prefer that elements don't repeat, you can use

from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))

Upvotes: 1

Related Questions