Reputation: 1
I've got a large (117MB!) html file that has thousands of images encoded as base64, I'd like to decode them to JPG's but my bash-fu isn't enough to do this and I haven't been able to find an answer online
Upvotes: 0
Views: 1741
Reputation: 1662
In general, HTML can't be parsed properly with regular expressions, but if you have a specific limited format then it could work.
Given a simple format like
<body>
<img src="data:image/jpeg;base64,DpFDPGOIg3renreGR43LGLJKds==">
<img src="data:image/jpeg;base64,DpFDPGOIg3renreGR43LGLJKds=="><img src="data:image/jpeg;base64,DpFaPGOIg3renreGR43LGLJKds==">
<div><img src="data:image/jpeg;base64,DpFdPGOIg3renreGR43LGLJKds=="></div>
</body>
the following can pull out the data
i=0; awk 'BEGIN{RS="<"} /="data:image\/jpeg;base64,[^\"]*"/ { match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' test.html | while read d; do echo $d | base64 -d > $i.jpg; i=$(($i+1)); done
To break that down:
i=0
Keep a counter so we can output different filenames for each image.
awk 'BEGIN{RS="<"}
Run awk with the Record Separator changed from the default newline to <, so we always treat each HTML element as a separate record.
/="data:image\/jpeg;base64,[^\"]*"/
Only run the following commands on records that have embedded base64 jpeg data.
{ match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }'
Pull out the data itself, the part matched with parentheses between the comma and the trailing quotation mark, then print it.
test.html
Just the input filename.
| while read d; do
Pipe the output base64 data to a loop. read
will put each line into d
until there's no more input.
echo $d | base64 -d > img$i.jpg;
Pass the current image through the base64 decoder and store the output to a file.
i=$(($i+1));
Increment to change the next filename.
done
Done.
There are a few things that could probably be done better here:
match()
function, but I couldn't get it to work.base64
doesn't know to only use one line of the input.echo $d | base64 -d > img$((i++)).jpg
) only wrote to the first file, even though echo $d > img$((i++)).b64
correctly wrote the encoded data to multiple files. Rather than waiting on working that out, I've just split the increment into its own command.Upvotes: 1
Reputation: 2794
Upvotes: 0