Reputation: 412
Suppose that each product has different versions that change over time, and I have a data set of time observations with the product id, version id and other data
I am interested in the Cartesian product of the indices of successive versions. i.e. the cartesian products of the indices of version_1
and version_2
, version_2
and version_3
and version_3
and version_4
.
For example the cartesian product of version_1
and version_2
is: (0,3), (1,3), (2,3), (0,4), (1,4), (2,4), version_2
and version_3
is (3,5), (3,6), (3,7), (4,5), (4,6), (4,7), etc. Ideally I would like two arrays: one of the left indices and one of the right.
Any hints as to how this can be done efficiently using numpy rather than manually looping which is very slow.
Upvotes: 1
Views: 151
Reputation: 412
The best way I found to do is to manually get the version order for each product, looping through the successive versions and then getting the indices of the cartesian product.
def cartesian_product(x: np.ndarray, y: np.ndarray):
return np.tile(x, len(y)), np.repeat(y, len(x))
unique_product_ids = np.unique(product_ids)
unique_countries = np.unique(countries)
indices_left_list = []
indices_right_list = []
for product_id in unique_product_ids:
current_product_versions = product_versions[product_ids == product_id]
_, indexes = np.unique(current_product_versions, return_index=True)
unique_versions_in_order = [current_product_versions[index] for index in sorted(indexes)]
for country in unique_countries:
for version_left, version_right in zip(unique_versions_in_order, unique_versions_in_order[1:]):
indices_left, indices_right = cartesian_product(
np.flatnonzero((countries == country) & (product_ids == product_id) & (product_versions == version_left)),
np.flatnonzero((countries == country) & (product_ids == product_id) & (product_versions == version_right))
)
indices_left_list.append(indices_left)
indices_right_list.append(indices_right)
indices_left = np.concatenate(indices_left_list)
indices_right = np.concatenate(indices_right_list)
Upvotes: 0
Reputation: 4588
You can try this:
import pandas as pd
import itertools
df = pd.DataFrame({'version': ['version_1', 'version_1', 'version_1', 'version_2', 'version_2', 'version_3', 'version_3', 'version_3', 'version_4']})
df.version = df.version.apply(lambda x: x[-1])
df = df.reset_index().groupby('version')['index'].apply(list).rename('versions').reset_index()
df['versions_shift'] = df['versions'].shift(-1, fill_value=[[]])
df['cartesian'] = df.apply(lambda x: itertools.product(x['versions'], x['versions_shift']), axis=1)
df['cartesian'] = df['cartesian'].apply(lambda x: list(zip(*x)))
df.drop(['version', 'versions', 'versions_shift'], axis=1, inplace=True)
print(df)
Ouput:
cartesian
0 [(0, 0, 1, 1, 2, 2), (3, 4, 3, 4, 3, 4)]
1 [(3, 3, 3, 4, 4, 4), (5, 6, 7, 5, 6, 7)]
2 [(5, 6, 7), (8, 8, 8)]
3 []
Upvotes: 1