DomWolfe
DomWolfe

Reputation: 137

Split binary by number of characters in Erlang using RegEx

I am trying to split a binary up into 80 character chucks.

Li= <<"Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Maecenas vitae ligula urna.     Etiam id pulvinar arcu. Ut
    maximus eros sed ligula blandit aliquet. Vivamus arcu urna,
    efficitur cursus dapibus nec, cursus sit amet elit. Aliquam
    tortor magna, aliquet vulputate nulla sit amet, efficitur cras amet.">>.

I have tried re:split(Li,"(.{80})") which gives me.

[<<>>,                                                                                                                                            
<<"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vitae ligula u">>,
<<>>,
 <<"rna. Etiam id pulvinar arcu. Ut maximus eros sed ligula blandit aliquet. Vivamus">>,
<<>>,
 <<" arcu urna, efficitur cursus dapibus nec, cursus sit amet elit. Aliquam tortor m">>,
<<"agna, aliquet vulputate nulla sit amet, efficitur cras amet.">>]

How do I get rid of the empty parts of the list and why am I getting them?

Upvotes: 1

Views: 547

Answers (3)

rvirding
rvirding

Reputation: 20916

You could do

re:run(B, <<".{80}">>,[{capture,first,binary},global]).

but it does return a list of lists of binaries.

Upvotes: 0

Soup in Boots
Soup in Boots

Reputation: 2392

You're getting empty parts because those are the matched portions between your tokens. re:split (like string:tokens) looks for data around the matched portions, not the matched portions themselves. The only reason you are receiving the eighty-character chunks is because you have a group in your regular expression.

To the best of my knowledge, there is no way to remove the empty parts of your result (without explicit filtering), because those are the parts that re:split expects to return.

One way you could achieve the desired result would be to use a standard regular expression (as opposed to splitting):

re:run("abcdefg", ".{2}", [global, {capture, all, binary}]) = {match,[[<<"ab">>],[<<"cd">>],[<<"ef">>]]}.

As you can see, we're simply matching all two-character groups we can find in the string.

That being said, regular expressions are not the ideal solution for this; they're overkill, to say the least. It should be relatively simple to write a function which extracts eighty-character chunks (or however many remain) from the binary. For instance:

make_chunks(<<C:80/binary>>, Rest/binary>>) ->
    [C|make_chunks(Rest)];
make_chunks(<<>>) ->
    [];
make_chunks(<<Rest/binary>>) ->
    [Rest].

That would also work and doesn't require complex evaluations or compiling of a regular expression. It may also make sense to use the "utf8" type (<<C:80/utf8>>) if you intend to handle Unicode.

Upvotes: 2

Toto
Toto

Reputation: 91385

I don't know erlang, but in many languages, when you split on regex with capture group, as you do, the group is put in the result.

So, you want to split on 80 charachers and keep the delimiter.

The result is:

  • First element: '' : this is what there is before the first delimiter (ie: before the first 80 characters)
  • Second element: Lorem ipsum ... ligula u : this the first delimiter (ie: the 80 character)
  • third element: '' : this is what there is between the first and second delimiter.
  • and so on ...

Upvotes: 1

Related Questions