Sara_Hsu
Sara_Hsu

Reputation: 57

How to get the cartesian product of a pandas dataframe under certain condition

Given a dataframe:

    qid cid title
0    1   a   croc 
1    2   b   dog    
2    3   a   fish
3    4   b   cat
4    5   a   bird

I want to get a new dataframe that is the cartesian product of each row with each other row which has the same cid value as it (that is, to get all the pairs of rows with the same cid):

    cid1 cid2 qid1 title1 qid2 title2
0    a     a    1   croc    3    fish
1    a     a    1   croc    5    bird
2    a     a    3   fish    5    bird
3    b     b    2   dog     4    cat

Suppose my dataset is about 500M, can anybody solve this problem in a comparatively efficient way?

Upvotes: 2

Views: 612

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

One way to do it s to use a self merge then filter out all the unwanted records.

df.merge(df, on='cid', suffixes=('1','2')).query('qid1 < qid2')

Output:

    qid1 cid title1  qid2 title2
1      1   a   croc     3   fish
2      1   a   croc     5   bird
5      3   a   fish     5   bird
10     2   b    dog     4    cat

Upvotes: 2

Related Questions