Reputation: 49
I am starting with a DataFrame that looks like this:
id tof
0 43.0 1999991.0
1 43.0 2095230.0
2 43.0 4123105.0
3 43.0 5560423.0
4 46.0 2098996.0
5 46.0 2114971.0
6 46.0 4130033.0
7 46.0 4355096.0
8 82.0 2055207.0
9 82.0 2093996.0
10 82.0 4193587.0
11 90.0 2059360.0
12 90.0 2083762.0
13 90.0 2648235.0
14 90.0 4212177.0
15 103.0 1993306.0
.
.
.
and ultimately my goal is to create a very long two dimensional array that contains all combinations of items with the same id like this (for rows with id 43):
[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]
except changing all the tuples to arrays so that I could transpose the array after iterating over all id numbers.
Naturally, itertools came to mind, and my first thought was doing something with df.groupby('id')
so that it would apply itertools internally to every group with the same id, but I would guess that this would take absolutely forever with the million line datafiles I have.
Is there a vectorized way to do this?
Upvotes: 0
Views: 545
Reputation: 294488
IIUC:
from itertools import combinations
pd.DataFrame([
[k, c0, c1] for k, tof in df.groupby('id').tof
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
id tof0 tof1
0 43.0 1999991.0 2095230.0
1 43.0 1999991.0 4123105.0
2 43.0 1999991.0 5560423.0
3 43.0 2095230.0 4123105.0
4 43.0 2095230.0 5560423.0
5 43.0 4123105.0 5560423.0
6 46.0 2098996.0 2114971.0
7 46.0 2098996.0 4130033.0
8 46.0 2098996.0 4355096.0
9 46.0 2114971.0 4130033.0
10 46.0 2114971.0 4355096.0
11 46.0 4130033.0 4355096.0
12 82.0 2055207.0 2093996.0
13 82.0 2055207.0 4193587.0
14 82.0 2093996.0 4193587.0
15 90.0 2059360.0 2083762.0
16 90.0 2059360.0 2648235.0
17 90.0 2059360.0 4212177.0
18 90.0 2083762.0 2648235.0
19 90.0 2083762.0 4212177.0
20 90.0 2648235.0 4212177.0
This is a list comprehension that returns a list of lists wrapped up by a dataframe constructor. Look up comprehensions to understand better.
from itertools import combinations
pd.DataFrame([
# name series of tof values
# ↓ ↓
[k, c0, c1] for k, tof in df.groupby('id').tof
# items from combinations
# first second
# ↓ ↓
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
Upvotes: 2
Reputation: 150785
Groupby does work:
def get_product(x):
return pd.MultiIndex.from_product((x.tof, x.tof)).values
for i, g in df.groupby('id'):
print(i, get_product(g))
Output:
43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
(1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
(2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
(4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
(5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
(5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
(2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
(2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
(4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
(4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
(4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
(2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
(4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
(2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
(2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
(2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
(4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
(4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]
Upvotes: 1
Reputation: 2417
from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))
if you'd prefer that elements don't repeat, you can use
from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))
Upvotes: 1