Python pandas dataframe and list merging

Question

I currently have a pandas DataFrame df:

         paper     reference
2171686  p84       r51
3816503  p41       r95
4994553  p112      r3
2948201  p112      r61
2957375  p32       r41
2938471  p65       r41
...

Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).

I need the following numbers for my analysis:

Frequency of elements of paper in df
When two elements from paper are randomly selected, the number of reference they cite in common

For number 1, I performed the following:

df_count = df.groupby(['paper'])['paper'].count()

For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:

from collections import defaultdict

pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
    d[row['paper']].append(row['paper'])
for ref, lst in d.items():
    for i in range(len(lst)):
        for j in range(i+1, len(lst)):
            pair.append([lst[i], lst[j], ref])

pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:

[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]

I would like to retrieve a DataFrame in the following format:

paper1      freq1       paper2       freq2        common
p17         4           p45          3            2
p5          2           p8           5            2
...

where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.

How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?

Alp Aribal · Accepted Answer

I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.

First, generate common reference counts:

# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])

# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]

# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]

Second, generate number of references per paper (you already got this):

df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]

Third, merge the two DataFrames:

# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")

# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]

Python pandas dataframe and list merging

Answers (1)

Related Questions