ajay1720
ajay1720

Reputation: 11

Improving prediction accuracy in Bayesian Causal Network

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.

I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.

  1. Ensuring no missing/NaN values
  2. Discretization (only it is supported by the library)
  3. SMOTE/SMOTETomek for data balancing
  4. Various train/test size combinations

I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.

Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?

Any ideas/literature/other suitable library regarding this would be of great help!

Upvotes: 1

Views: 453

Answers (1)

Gabriel Azevedo
Gabriel Azevedo

Reputation: 21

A few tips that can help:

  1. Changing/Tuning the Structure learning.
  • Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).

    This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.

  • Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.

  • Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.

  1. Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:
  • Methods available in Causalnex (uniform, for example)
  • fixed discretisations based on what thresholds make sense for your data
  • MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

Upvotes: 2

Related Questions