Reputation: 147
I have a data frame of patent numbers and the inventors who invented those patents. For example:
patent_number | inventor_id |
---|---|
1 | A |
1 | B |
2 | B |
2 | C |
3 | A |
3 | B |
I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.
What is the fastest way of counting the number of unique teams using python?
I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.
teams = []
for pat_id, pat_df in inventor_data.groupby("patent_number"):
if list(pat_df["inventor_id"]) not in teams:
teams.append(list(pat_df["inventor_id"]))
print("Number of teams ", len(teams))
I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.
Thank you!
Upvotes: 1
Views: 80
Reputation: 260640
You can groupby
and aggregate as frozenset and count the unique values:
df.groupby('patent_number')['inventor_id'].agg(frozenset).nunique()
Output: 2
Interestingly, you can also easily get the number of occurrences of each team with value_counts
:
df.groupby('patent_number')['inventor_id'].agg(frozenset). value_counts()
Output:
(B, A) 2
(B, C) 1
Name: inventor_id, dtype: int64
Upvotes: 4
Reputation: 86
You could go for:
inventor_data = inventor_data.sort_values("inventor_id")
inventor_data.groupby("patent_number").inventor_id.sum().nunique()
A few explanations:
Upvotes: 1