Reputation: 61
I like to scan larger(>500M) binary files for structs/patterns. I am new to the language an hope that someone can give me start. Actually the files are a database containing Segments. A Segment starts with a fixed sized header followed by a fixed sized optional part followed by the payload/data part of variable lenght. For a first test i just like to log the number of segments in the file. I googled already for some tutorial but found nothing which helped. I need a hint or a tutorial which is not too far from my use case to get started.
Greets Stefan
Upvotes: 6
Views: 1935
Reputation: 26121
When your data fits into memory, best thing what you can to do is read data in whole using file:read_file/1
. If you can't use file in raw
mode. Then you can parse data using bit_syntax. If you write it in right manner you can achieve parsing speed in tens of MB/s when parsing module is compile using HiPE. Exact techniques of parsing depends on exact segment data format and how robust/accurate result you are looking for. For parallel parsing you can inspire by Tim Bray's Wide Finder project.
Upvotes: 1
Reputation: 2880
Here is a synthesized sample problem: I have a binary file (test.txt) that I want to parse. I want to find all the binary patterns of <<$a, $b, $c>>
in the file.
The content of "test.txt":
I arbitrarily decide to choose the string "abc" as my target string for my test. I want to find all the abc's in my testing file.
A sample program (lab.erl):
-module(lab).
-compile(export_all).
find(BinPattern, InputFile) ->
BinPatternLength = length(binary_to_list(BinPattern)),
{ok, S} = file:open(InputFile, [read, binary, raw]),
loop(S, BinPattern, 0, BinPatternLength, 0),
file:close(S),
io:format("Done!~n", []).
loop(S, BinPattern, StartPos, Length, Acc) ->
case file:pread(S, StartPos, Length) of
{ok, Bin} ->
case Bin of
BinPattern ->
io:format("Found one at position: ~p.~n", [StartPos]),
loop(S, BinPattern, StartPos + 1, Length, Acc + 1);
_ ->
loop(S, BinPattern, StartPos + 1, Length, Acc)
end;
eof ->
io:format("I've proudly found ~p matches:)~n", [Acc])
end.
Run it:
1> c(lab).
{ok,lab}
2> lab:find(<<"abc">>, "./test.txt").
Found one at position: 43.
Found one at position: 103.
I've proudly found 2 matches:)
Done!
ok
Note that the code above is not very efficient (the scanning process shifts one byte at a time) and it is sequential (not utilizing all the "cores" on your computer). It is meant only to get you started.
Upvotes: 1
Reputation: 7836
you need to learn about Bit Syntax and Binary Comprehensions. More useful links to follow: http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html, and http://goto0.cubelogic.org/a/90.
You will also need to learn how to process files, reading from files (line-by-line, chunk-by-chunk, at given positions in a file, e.t.c.), writing to files in several ways. The file processing functions are explained here
You can also choose to look at the source code of large file processing libraries within the erlang packages e.g. Disk Log, Dets and mnesia. These libraries heavily read and write into files and their source code is open for you to see.
I hope that helps
Upvotes: 4