Reputation: 527
I currently have a pandas
DataFrame df
:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df
shows the relationship of citation between paper
and reference
(where paper
cites reference
).
I need the following numbers for my analysis:
Frequency of elements of paper
in df
When two elements from paper
are randomly selected, the number of reference
they cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper
that cite the same element in reference
:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair
is a list that consists of three elements: first two elements are the pair of paper
, and the third element is from reference
that both paper
elements cite. Below is what pair
looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1
and paper2
represent the first two elements of each list of pair
, freq1
and freq2
represent the frequency count of each paper done by df_count
, and common
is a number of reference
both paper1
and paper2
cite in common.
How can I retrieve my desired dataset (in the desired format) from df
, df_count
, and pair
?
Upvotes: 0
Views: 75
Reputation: 370
I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.
First, generate common reference counts:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
Second, generate number of references per paper (you already got this):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
Third, merge the two DataFrames:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]
Upvotes: 1