Reputation: 57
Given a dataframe:
qid cid title
0 1 a croc
1 2 b dog
2 3 a fish
3 4 b cat
4 5 a bird
I want to get a new dataframe that is the cartesian product of each row with each other row which has the same cid value as it (that is, to get all the pairs of rows with the same cid):
cid1 cid2 qid1 title1 qid2 title2
0 a a 1 croc 3 fish
1 a a 1 croc 5 bird
2 a a 3 fish 5 bird
3 b b 2 dog 4 cat
Suppose my dataset is about 500M, can anybody solve this problem in a comparatively efficient way?
Upvotes: 2
Views: 612
Reputation: 153460
One way to do it s to use a self merge
then filter out all the unwanted records.
df.merge(df, on='cid', suffixes=('1','2')).query('qid1 < qid2')
Output:
qid1 cid title1 qid2 title2
1 1 a croc 3 fish
2 1 a croc 5 bird
5 3 a fish 5 bird
10 2 b dog 4 cat
Upvotes: 2