blue piranha
blue piranha

Reputation: 3874

Extracting data from a large file with regex

I have a close to 800 MB file which consists of several (header followed by content). Header looks something like this M=013;X=rast;645.jpg while content is binary of the jpg file.

So the file looks something like this

M=013;X=rast;645.jpgNULœDüŠˆ.....M=217;X=rast;113.jpgNULÿñÿÿ&åbÿås....M=217;X=rast;1108.jpgNUL]_ÿ×ÉcË/...

The header can occur in one line or across two lines.

I need to parse this file and basically pop out the several jpg images.

Since this is too big a file, please suggest an efficient way? I was hoping to use StreamReader but do not have much experience with regular expressions to use with it.

Upvotes: 0

Views: 172

Answers (1)

CSᵠ
CSᵠ

Reputation: 10169

RegEx:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=(?1)|$))/gs *with recursion (not supported in .NET)

.NET RegEx workaround:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=M=.+?;X=.+?;.+?\.jpg|$))/gs
replaced the (?1) recursion group with the contents inside the 1st capture group

Live demo and Explanation of RegExp: http://regex101.com/r/nQ3pE0/1

You'll want to use the 2nd capture group for binary contents, the 1st group will match the header and the expression needs it to know where to stop.

*edited in italic

Upvotes: 1

Related Questions