find most frequent pairs in a dataframe

Question

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:

meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows

I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.

As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:

person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...

But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:

import pandas as pd

links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
    print(i, 'of', len(grouped))
    person_id = group[0].strip()
    if len(person_id) == 14:
        meetings = set(group[1]['meeting_id'])
        for meeting in meetings:
            lic_sub = lic[lic['meeting_id'] == meeting]
            people = set(lic_sub['person_id'])
            for person in people:
                if person != person_id:
                    tup = (person_id, person)
                    links.append(tup)

df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)

Any ideas?

find most frequent pairs in a dataframe

Answers (1)

Related Questions