Reputation: 1191
StanfordNLP's TreeLSTM, when used with a dataset with > 30K instances, causes LuaJit to error with "Not Enough Memory." I am resolving this by using LuaJit Data Structures. In order to get the dataset outside of lua's heap, the trees need to be placed in a LDS.Vector.
Since the LDS.Vector holds cdata, the first step was to make the Tree type into a cdata object:
local ffi = require('ffi')
ffi.cdef([[
typedef struct CTree {
struct CTree* parent;
int num_children;
struct CTree* children [25];
int idx;
int gold_label;
int leaf_idx;
} CTree;
]])
There are also small changes that need to be made in read_data.lua to handle the new cdata CTree type. Using LDS seemed like a reasonable approach to solve the memory limit so far; however, the CTree requires a field named 'composer'.
Composer is of the type nn.gModule. To continue with this solution would involve creating a typedef of the nn.gModule as cdata, including creating a typedef for its members. Before continuing, does this seem like the correct direction to follow? Does any one have experience with this problem?
Upvotes: 9
Views: 1467
Reputation: 21
As you've discovered, representing structured data in a LuaJIT heap-friendly manner is a bit of a pain at the moment.
In the Tree-LSTM implementation, the tree tables each hold a pointer to a composer instance mainly for expediency in implementation.
One workaround to avoid typedef-ing nn.gModule
would be to use the existing idx
field to index into a table of composer instances. In this approach, the pair (sentence_idx
, node_idx
) can be used uniquely identify a composer in a global two-level table of composer instances. To avoid memory issues, the current cleanup code can be replaced with a line that sets the corresponding index in the table to nil
.
Upvotes: 2