Scanning big binary with Erlang

I like to scan larger(>500M) binary files for structs/patterns. I am new to the language an hope that someone can give me start. Actually the files are a database containing Segments. A Segment starts with a fixed sized header followed by a fixed sized optional part followed by the payload/data part of variable lenght. For a first test i just like to log the number of segments in the file. I googled already for some tutorial but found nothing which helped. I need a hint or a tutorial which is not too far from my use case to get started.

Greets Stefan

Upvotes: 6

Answers (3)

Hynek -Pichi- Vychodil

Reputation: 26121

When your data fits into memory, best thing what you can to do is read data in whole using file:read_file/1. If you can't use file in raw mode. Then you can parse data using bit_syntax. If you write it in right manner you can achieve parsing speed in tens of MB/s when parsing module is compile using HiPE. Exact techniques of parsing depends on exact segment data format and how robust/accurate result you are looking for. For parallel parsing you can inspire by Tim Bray's Wide Finder project.

Upvotes: 1

Ning

Reputation: 2880

Here is a synthesized sample problem: I have a binary file (test.txt) that I want to parse. I want to find all the binary patterns of <<$a, $b, $c>> in the file.

The content of "test.txt":

I arbitrarily decide to choose the string "abc" as my target string for my test. I want to find all the abc's in my testing file.

A sample program (lab.erl):

-module(lab).
-compile(export_all).

find(BinPattern, InputFile) ->
    BinPatternLength = length(binary_to_list(BinPattern)),
    {ok, S} = file:open(InputFile, [read, binary, raw]),
    loop(S, BinPattern, 0, BinPatternLength, 0),
    file:close(S),
    io:format("Done!~n", []).

loop(S, BinPattern, StartPos, Length, Acc) ->
    case file:pread(S, StartPos, Length) of
    {ok, Bin} ->
        case Bin of
        BinPattern ->
            io:format("Found one at position: ~p.~n", [StartPos]),
            loop(S, BinPattern, StartPos + 1, Length, Acc + 1);
        _ ->
            loop(S, BinPattern, StartPos + 1, Length, Acc)
        end;
    eof ->
        io:format("I've proudly found ~p matches:)~n", [Acc])
    end.

Run it:

1> c(lab).
{ok,lab}
2> lab:find(<<"abc">>, "./test.txt").     
Found one at position: 43.
Found one at position: 103.
I've proudly found 2 matches:)
Done!
ok

Note that the code above is not very efficient (the scanning process shifts one byte at a time) and it is sequential (not utilizing all the "cores" on your computer). It is meant only to get you started.

Upvotes: 1

Muzaaya Joshua

Reputation: 7836

you need to learn about Bit Syntax and Binary Comprehensions. More useful links to follow: http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html, and http://goto0.cubelogic.org/a/90.

You will also need to learn how to process files, reading from files (line-by-line, chunk-by-chunk, at given positions in a file, e.t.c.), writing to files in several ways. The file processing functions are explained here

You can also choose to look at the source code of large file processing libraries within the erlang packages e.g. Disk Log, Dets and mnesia. These libraries heavily read and write into files and their source code is open for you to see.

I hope that helps

Upvotes: 4

Scanning big binary with Erlang

Answers (3)

Related Questions