Reputation: 143
been trying to run the RF classifier on a data set of ~50,000 entries with 20 or so labels which I thought should be fine but I keep coming across the following when trying to fit...
Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize' ignored
Segmentation fault (core dumped)
The data set has been passed through the TfidfVectorizer and then TruncatedSVD with n=100 for dimensionality reduction. RandomForestClassifier is running with n_jobs=1 and n_estimators=10 in an attempt to get find the minimum point at which it will work. The system is running with 4GB of RAM and RF has worked in the past on a similar data set with much higher numbers of estimators etc. Scikit-learn is running at the current version 0.14.1.
Any tips?
Thanks
Upvotes: 10
Views: 1917
Reputation: 5
Try to use 'psutil' library (link: https://pypi.python.org/pypi/psutil/0.5.0). Thanks to this library, you can monitorize the amount of available memory of your system with the following function:
psutil.phymem_usage()
This will help you to detect if your system does not have enough memory or if it is a problem of your code.
Upvotes: 0
Reputation: 10556
Segfaults are always bugs. If a malloc
fails inside RandomForest
then it should be caught, and it is my best guess that this is what is happening to you. As a commenter already said, you should report this to the RandomForest bug tracker. But the malloc
is probably failing because of an out of memory condition, so reduce your dimensionality, reduce your training data set size, get more memory, or run on a system with more memory.
Upvotes: 2