Providing fields for every id through network in pandas

Question

I am sorry for re-posting this old question of mine but it has not been quite solved and the older post was getting long and confusing.

So, the issue is the one described below basically. A very good reply to the problem was given to me by applying network theory and namely performing this:

# generate exploded version of DataFrame to be able to construct the graph
df2 = (result2_min.loc[m]
        .explode(['citing_patents', 'dist_citing_patents'])
      )

# build the directed graph with weights
import networkx as nx

G = nx.from_pandas_edgelist(df2.rename(columns={'dist_citing_patents': 'weight'}),
                           source='docdb_family_id', target='citing_patents',
                           edge_attr='weight',
                           create_using=nx.DiGraph)

# find closest leaf for each node
def distance(n, leaf):
   try:
       return nx.dijkstra_path_length(G, n, leaf, weight='weight')
   except nx.NetworkXNoPath:
       return float('inf')

mapper = {n: min(leafs, key=lambda leaf: distance(n, leaf)) for n in G.nodes}

# map leaf to field
fields = result2_min[~m].set_index('docdb_family_id')['oecd_fields']

# map each node to terminal leafs to field
result2_min['New_var'] = result2_min['Cited_patents'].map(mapper).map(fields)

however, my data frame contains 9 Million observations, and this part takes forever to run:

mapper = {n: min(leafs, key=lambda leaf: distance(n, leaf)) for n in G.nodes}

Hence I was curious about whether there could be either another "less computationally expensive" solution or a workaround to re-write the mapper code in a "linear-complex" way (O(n)).

I will leave you below the code for generating the mock example:

# initialize list of lists
data = [[1, [7,3], [1,1], ""], [2, [1,5], [2,1], "Math"], [3, [1,2,6], [2,0,2], ""],[4, [7], [1], "Science"],[5, [1,2], [2,0], ""],[6, [5,8], [1,1], ""],[7, [4,8], [0,1], ""],[8, [4], [0], ""]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_patents','dist_cited_patents','Fields'])

Thank you

Laurent · Accepted Answer

Here is a different approach of your interesting question, less sophisticated and probably less efficient on your real data than your current solution, but I would be interested to know anyway.

def select_patents(dist, patents):
    """Helper function.

    Args:
        dist: distances of cited patents.
        patents: cited patents.

    Returns:
        Patents with minimum distance.

    """
    return [p for d, p in zip(dist, patents) if d == min(dist)]


# Find patents
df["selected_patents"] = df.apply(
    lambda x: select_patents(x["dist_cited_patents"], x["cited_patents"]), axis=1
)
df["new_var"] = df["fields"]

# Follow selected patents
while not df["new_var"].all():
    df["new_var"] = df.apply(
        lambda x: [
            df.loc[df["docdb_id"] == i, "new_var"].values[0]
            if not x["new_var"]
            else x["new_var"]
            for i in x["selected_patents"]
        ],
        axis=1,
    )
    df["new_var"] = df.apply(
        lambda x: ""
        if (isinstance(x["new_var"], list) and not all(x["new_var"]))
        or not x["new_var"]
        else x["new_var"],
        axis=1,
    )

# Cleanup
df["new_var"] = df.apply(lambda x: [item[0] for item in x["new_var"]], axis=1)
df["new_var"] = df["new_var"].apply(lambda x: x[0] if x and len(x) == 1 else x)

print(df)
   docdb_id cited_patents dist_cited_patents   fields          new_var
0         1        [7, 3]             [1, 1]           [Science, Math]
1         2        [1, 5]             [2, 1]     Math             Math
2         3     [1, 2, 6]          [2, 0, 2]                      Math
3         4           [7]                [1]  Science          Science
4         5        [1, 2]             [2, 0]                      Math
5         6        [5, 8]             [1, 1]           [Math, Science]
6         7        [4, 8]             [0, 1]                   Science
7         8           [4]                [0]                   Science

Providing fields for every id through network in pandas

Answers (1)

Related Questions