Reputation: 11
I am using CausalNex to create a DAG from a dataset in Python.
I got the graph, and the nodes are correct, but the edges are totally off. I tried this in a DataFrame with four random independent variables (Requestor, Risk, Size, Developer) and a single dependent one (Duration), and the graph produced is this:
Am I using the library incorrectly? Why is the figure so distant from the true data-generating process? Could a Bayesian Network model outperform CausalNex?
I tried this code:
# Generate initial data
import numpy as np
import pandas as pd
np.random.seed(42)
fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
df = pd.DataFrame({
"Requestor": np.random.randint(1, 4, 100),
"Size": np.random.randint(1, 4, 100),
"Risk": np.random.randint(1, 4, 100)
})
df['Developer'] = np.random.choice(fib_list, df.shape[0])
df["Duration"] = (
0.1 * df["Requestor"] +
0.2 * df["Size"] +
0.2 * df["Risk"] +
0.5 * df["Developer"]
)
# Generate graph
from causalnex.structure.notears import from_pandas
import matplotlib.pyplot as plt
import networkx as nx
sm = from_pandas(df)
sm.remove_edges_below_threshold(0.8)
nx.draw_shell(sm, with_labels=True, font_weight ="bold")
plt.show()
I was expecting something like this:
Upvotes: 0
Views: 192
Reputation: 541
I would say that the relations between the variables are not easy to capture (particularly due to the domain size of Developer). The parents of continuous "Duration" have a domain size of 4*4*4*12
... And duration itself is not really continuous, but can take 102 different values ...
So a database of size 100 is really not enough for the tests/scores to be accurate during the learning algorithms.
Note that I multiplied Duration by 10 to keep integer values.
FYI an inference is the last BN
The code :
import numpy as np
import pandas as pd
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
gum.config["notebook","default_graph_size"]="3!" #change default size for BN
def createDB(N:int):
# code from Rafaela Medeiros
np.random.seed(42)
fib_list = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
data = {"Requestor": np.random.randint(1,4,N),
"Size": np.random.randint(1,4,N),
"Risk": np.random.randint(1,4,N)}
df = pd.DataFrame(data)
df['Developer'] = np.random.choice(fib_list, df.shape[0])
df["Duration"] = (1*df["Requestor"] + 2*df["Size"] + 2*df["Risk"] + 5*df["Developer"])
return df
def learnForSize(N:int):
learner=gum.BNLearner(createDB(N))
learner.useMIIC() # choosing this algo to learn
bn=learner.learnBN()
return bn
sizes=[100,5000,55000]
gnb.flow.row(*[learnForSize(N) for N in sizes],
captions=[f"{size=}" for size in sizes])
Upvotes: 1