NightFury13
NightFury13

Reputation: 771

How to read large files (>1GB) in lua?

I am a novice at Lua (use it for the Torch7 framework). I have an input feature file which is about 1.4GBs in size (text file). The simple io.open function throws an error 'not enough memory' on trying to open this file. While browsing through the user-groups and documentation, I see that its possibly a Lua limitation. Is there a workaround for this? Or am I doing something wrong in reading the file?

local function parse_file(path)
    -- read file
    local file = assert(io.open(path,"r"))
    local content = file:read("*all")
    file:close()

    -- split on start/end tags.
    local sections = string.split(content, start_tag)
    for j=1,#sections do
        sections[j] = string.split(sections[j],'\n')
        -- remove the end_tag
        table.remove(sections[j], #sections[j])
    end 
    return sections
end

local train_data = parse_file(file_loc .. '/' .. train_file)

EDIT : The input file I am trying to read contains image features I would like to train my model on. This file is in a ordered fashion ({start-tag} ...contents...{end-tag}{start-tag} ... and so on...), so it is fine if I can load these sections (start-tag to end-tag) one at a time. However, I would want all these sections to be loaded in memory.

Upvotes: 3

Views: 3013

Answers (2)

NightFury13
NightFury13

Reputation: 771

Turns out, the simplest way around the loading large files problem is to upgrade Torch to Lua5.2 or greater! As suggested by the developers of Torch on the torch7-google-group.

cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh

The memory limits don't exist from the 5.2 version onwards! I have tested this and it works just fine!

Reference : https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo


Another possible solution (which is more elegant and similar to what @Adam suggested in his answer) is to use read the file line by line and use Tensors or tds to store the data as this uses memory outside of Luajit. A code sample is as below, thanks to Vislab.

local ffi = require 'ffi'
-- this function loads a file line by line to avoid having memory issues
local function load_file_to_tensor(path)
  -- intialize tensor for the file
  local file_tensor = torch.CharTensor()
  
  -- Now we must determine the maximum size of the tensor in order to allocate it into memory.
  -- This is necessary to allocate the tensor in one sweep, where columns correspond to letters and rows correspond to lines in the text file.
  
  --[[ get  number of rows/columns ]]
  local file = io.open(path, 'r') -- open file
  local max_line_size = 0
  local number_of_lines = 0
  for line in file:lines() do
    -- get maximum line size
    max_line_size = math.max(max_line_size, #line +1) -- the +1 is important to correctly fetch data
    
    -- increment the number of lines counter
    number_of_lines = number_of_lines +1
  end
  file:close() --close file
  
  -- Now that we have the maximum size of the vector, we just have to allocat memory for it (as long there is enough memory in ram)
  file_tensor = file_tensor:resize(number_of_lines, max_line_size):fill(0)
  local f_data = file_tensor:data()
  
  -- The only thing left to do is to fetch data into the tensor. 
  -- Lets open the file again and fill the tensor using ffi
  local file = io.open(path, 'r') -- open file
  for line in file:lines() do
    -- copy data into the tensor line by line
    ffi.copy(f_data, line)
    f_data = f_data + max_line_size
  end
  file:close() --close file

  return file_tensor
end

To read data from this tensor is simple and quick. For example, if you wanto to read the 10'th line in the file (which will be in the 10'th position on the tensor) you can simple do the following:

local line_string = ffi.string(file_tensor[10]:data()) -- this will convert into a string var

A word of warning: this will occupy more space in memory, and may not be optimal for some cases where a few lines are way longer than the other. But if you don't have memory issues, this can even be disregarded because when loading tensors from files into memory it is blazingly fast and might save you some grey hairs in the process.

Reference : https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo

Upvotes: 4

Adam
Adam

Reputation: 81

I've never had the need to read a file so large, but if your running out of memory you will probably need to read it line by line. After some quick research I found this from the lua website:

buff = buff..line.."\n"

buff is a new string with 50,020 bytes, and the old string in now > garbage. After two loop cycles, buff is a string with 50,040 bytes, and there are two old strings making a total of more than 100 Kbytes of garbage. Therefore, Lua decides, quite correctly, that it is a good time to run its garbage collector, and so it frees those 100 Kbytes. The problem is that this will happen every two cycles, and so Lua will run its garbage collector two thousand times before finishing the loop. Even with all this work, its memory usage will be around three times the file size. To make things worse, each concatenation must copy the whole string content (50 Kbytes and growing) into the new string.

So it seems that loading large files uses insane amounts of memory even if you do read it line by line and use a concatenation each time like this:

local buff = ""  
while 1 do  
    local line = read()  
    if line == nil then break end  
    buff = buff..line.."\n"  
end  

They then propose a more memory conserving process:

  function newBuffer ()
      return {n=0}     -- 'n' counts number of elements in the stack
  end  

  function addString (stack, s)
    table.insert(stack, s)       -- push 's' into the top of the stack
    for i=stack.n-1, 1, -1 do
      if string.len(stack[i]) > string.len(stack[i+1]) then break end
      stack[i] = stack[i]..table.remove(stack)
    end
  end

  function toString (stack)
    for i=stack.n-1, 1, -1 do
      stack[i] = stack[i]..table.remove(stack)
    end
    return stack[1]
  end

Which takes way less memory than before. All the information is from: http://www.lua.org/notes/ltn009.html
Hope that helped.

Upvotes: 0

Related Questions