tyler
tyler

Reputation: 1

batch base64 image decode

I've got a large (117MB!) html file that has thousands of images encoded as base64, I'd like to decode them to JPG's but my bash-fu isn't enough to do this and I haven't been able to find an answer online

Upvotes: 0

Views: 1741

Answers (3)

Harun
Harun

Reputation: 1662

In general, HTML can't be parsed properly with regular expressions, but if you have a specific limited format then it could work.

Given a simple format like

<body>
<img src="data:image/jpeg;base64,DpFDPGOIg3renreGR43LGLJKds==">
<img src="data:image/jpeg;base64,DpFDPGOIg3renreGR43LGLJKds=="><img src="data:image/jpeg;base64,DpFaPGOIg3renreGR43LGLJKds==">
<div><img src="data:image/jpeg;base64,DpFdPGOIg3renreGR43LGLJKds=="></div>
</body>

the following can pull out the data

i=0; awk 'BEGIN{RS="<"} /="data:image\/jpeg;base64,[^\"]*"/ { match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' test.html | while read d; do echo $d  | base64 -d > $i.jpg; i=$(($i+1)); done

To break that down:

i=0 Keep a counter so we can output different filenames for each image.

awk 'BEGIN{RS="<"} Run awk with the Record Separator changed from the default newline to <, so we always treat each HTML element as a separate record.

/="data:image\/jpeg;base64,[^\"]*"/ Only run the following commands on records that have embedded base64 jpeg data.

{ match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' Pull out the data itself, the part matched with parentheses between the comma and the trailing quotation mark, then print it.

test.html Just the input filename.

| while read d; do Pipe the output base64 data to a loop. read will put each line into d until there's no more input.

echo $d | base64 -d > img$i.jpg; Pass the current image through the base64 decoder and store the output to a file.

i=$(($i+1)); Increment to change the next filename.

done Done.

There are a few things that could probably be done better here:

  • There should be a way to get the line-match regexp to capture the base64 data directly, instead of repeating the regexp in a call to the match() function, but I couldn't get it to work.
  • I don't like the technique of reading a pipe into the variable d, only to echo it back out to another pipe - it would be nicer to just pipe straight through - but base64 doesn't know to only use one line of the input.
  • For some reason I have not yet figured out, incrementing the counter directly where it's used (i.e. echo $d | base64 -d > img$((i++)).jpg) only wrote to the first file, even though echo $d > img$((i++)).b64 correctly wrote the encoded data to multiple files. Rather than waiting on working that out, I've just split the increment into its own command.

Upvotes: 1

Kyle Banerjee
Kyle Banerjee

Reputation: 2794

  1. Use regex to direct the base64 images to separate files
  2. Write loop to iterate through your files.
  3. Bash command to decode files will be along lines of: cat base64_file1 |base64 -d > file1.jpg

Upvotes: 0

godsuya
godsuya

Reputation: 1

You can try scrapping the encoded strings of the images using Python. Then check this out for converting the encoded strings to images.

Upvotes: 0

Related Questions