rafaela medeiros
rafaela medeiros

Reputation: 11

Why CausalNex output in python is wrong?

I am using CausalNex to create a DAG from a dataset in Python.

I got the graph, and the nodes are correct, but the edges are totally off. I tried this in a DataFrame with four random independent variables (Requestor, Risk, Size, Developer) and a single dependent one (Duration), and the graph produced is this:

DAG using CausalNex

Am I using the library incorrectly? Why is the figure so distant from the true data-generating process? Could a Bayesian Network model outperform CausalNex?

I tried this code:

# Generate initial data

import numpy as np
import pandas as pd

np.random.seed(42)
fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

df = pd.DataFrame({
    "Requestor": np.random.randint(1, 4, 100),
    "Size": np.random.randint(1, 4, 100),
    "Risk": np.random.randint(1, 4, 100)
})

df['Developer'] = np.random.choice(fib_list, df.shape[0])
df["Duration"] = (
    0.1 * df["Requestor"] +
    0.2 * df["Size"] +
    0.2 * df["Risk"] +
    0.5 * df["Developer"]
)

# Generate graph

from causalnex.structure.notears import from_pandas
import matplotlib.pyplot as plt
import networkx as nx

sm = from_pandas(df)
sm.remove_edges_below_threshold(0.8)
nx.draw_shell(sm, with_labels=True, font_weight ="bold")
plt.show()

I was expecting something like this:

Expected Output

Upvotes: 0

Views: 192

Answers (1)

Pierre-Henri Wuillemin
Pierre-Henri Wuillemin

Reputation: 541

I would say that the relations between the variables are not easy to capture (particularly due to the domain size of Developer). The parents of continuous "Duration" have a domain size of 4*4*4*12 ... And duration itself is not really continuous, but can take 102 different values ...

So a database of size 100 is really not enough for the tests/scores to be accurate during the learning algorithms.

enter image description here

Note that I multiplied Duration by 10 to keep integer values.

FYI an inference is the last BN

enter image description here

The code :


import numpy as np 
import pandas as pd 

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
gum.config["notebook","default_graph_size"]="3!" #change default size for BN

def createDB(N:int):
    # code from Rafaela Medeiros
    np.random.seed(42) 
    fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89] 
    data = {"Requestor": np.random.randint(1,4,N), 
            "Size": np.random.randint(1,4,N), 
            "Risk": np.random.randint(1,4,N)} 

    df = pd.DataFrame(data)
    df['Developer'] = np.random.choice(fib_list, df.shape[0]) 
    df["Duration"] = (1*df["Requestor"] + 2*df["Size"] + 2*df["Risk"] + 5*df["Developer"])
    
    return df

def learnForSize(N:int):
    learner=gum.BNLearner(createDB(N))
    learner.useMIIC() # choosing this algo to learn
    bn=learner.learnBN()
    return bn

sizes=[100,5000,55000]
gnb.flow.row(*[learnForSize(N) for N in sizes],
             captions=[f"{size=}" for size in sizes])

Upvotes: 1

Related Questions