Lusian
Lusian

Reputation: 653

Providing fields for every id through network in pandas

I am sorry for re-posting this old question of mine but it has not been quite solved and the older post was getting long and confusing.

enter image description here

So, the issue is the one described below basically. A very good reply to the problem was given to me by applying network theory and namely performing this:

# generate exploded version of DataFrame to be able to construct the graph
df2 = (result2_min.loc[m]
        .explode(['citing_patents', 'dist_citing_patents'])
      )

# build the directed graph with weights
import networkx as nx

G = nx.from_pandas_edgelist(df2.rename(columns={'dist_citing_patents': 'weight'}),
                           source='docdb_family_id', target='citing_patents',
                           edge_attr='weight',
                           create_using=nx.DiGraph)

# find closest leaf for each node
def distance(n, leaf):
   try:
       return nx.dijkstra_path_length(G, n, leaf, weight='weight')
   except nx.NetworkXNoPath:
       return float('inf')

mapper = {n: min(leafs, key=lambda leaf: distance(n, leaf)) for n in G.nodes}

# map leaf to field
fields = result2_min[~m].set_index('docdb_family_id')['oecd_fields']

# map each node to terminal leafs to field
result2_min['New_var'] = result2_min['Cited_patents'].map(mapper).map(fields)

however, my data frame contains 9 Million observations, and this part takes forever to run:

mapper = {n: min(leafs, key=lambda leaf: distance(n, leaf)) for n in G.nodes}

Hence I was curious about whether there could be either another "less computationally expensive" solution or a workaround to re-write the mapper code in a "linear-complex" way (O(n)).

I will leave you below the code for generating the mock example:

# initialize list of lists
data = [[1, [7,3], [1,1], ""], [2, [1,5], [2,1], "Math"], [3, [1,2,6], [2,0,2], ""],[4, [7], [1], "Science"],[5, [1,2], [2,0], ""],[6, [5,8], [1,1], ""],[7, [4,8], [0,1], ""],[8, [4], [0], ""]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_patents','dist_cited_patents','Fields'])

Thank you

Upvotes: 1

Views: 92

Answers (1)

Laurent
Laurent

Reputation: 13518

Here is a different approach of your interesting question, less sophisticated and probably less efficient on your real data than your current solution, but I would be interested to know anyway.

def select_patents(dist, patents):
    """Helper function.

    Args:
        dist: distances of cited patents.
        patents: cited patents.

    Returns:
        Patents with minimum distance.

    """
    return [p for d, p in zip(dist, patents) if d == min(dist)]


# Find patents
df["selected_patents"] = df.apply(
    lambda x: select_patents(x["dist_cited_patents"], x["cited_patents"]), axis=1
)
df["new_var"] = df["fields"]

# Follow selected patents
while not df["new_var"].all():
    df["new_var"] = df.apply(
        lambda x: [
            df.loc[df["docdb_id"] == i, "new_var"].values[0]
            if not x["new_var"]
            else x["new_var"]
            for i in x["selected_patents"]
        ],
        axis=1,
    )
    df["new_var"] = df.apply(
        lambda x: ""
        if (isinstance(x["new_var"], list) and not all(x["new_var"]))
        or not x["new_var"]
        else x["new_var"],
        axis=1,
    )

# Cleanup
df["new_var"] = df.apply(lambda x: [item[0] for item in x["new_var"]], axis=1)
df["new_var"] = df["new_var"].apply(lambda x: x[0] if x and len(x) == 1 else x)
print(df)
   docdb_id cited_patents dist_cited_patents   fields          new_var
0         1        [7, 3]             [1, 1]           [Science, Math]
1         2        [1, 5]             [2, 1]     Math             Math
2         3     [1, 2, 6]          [2, 0, 2]                      Math
3         4           [7]                [1]  Science          Science
4         5        [1, 2]             [2, 0]                      Math
5         6        [5, 8]             [1, 1]           [Math, Science]
6         7        [4, 8]             [0, 1]                   Science
7         8           [4]                [0]                   Science

Upvotes: 1

Related Questions