Coding of images in Blink archive

Question

I have a Blink archive (in mht format) saved from Chrome browser. I'm trying to convert the section

Content-Type: image/jpeg
Content-Transfer-Encoding: binary
Content-Location: https://some_url

ÿØÿà^@^PJFIF^@^A^A^A^@`^@`^@^@ÿÛ^@C^@^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^AÿÛ^
^KÿÄ^@µ^P^@^B^A^C^C^B^D^C^E^E^D^D^@^@^A}^A^B^C^@

to image file as follows

string s = "
ÿØÿà^@^PJ..."
byte [] result = System.Convert.FromBase64String(s)
File.WriteAllBytes("image.jpg", result);

And I have an error message The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.

How can I fix it? There are probably characters in the string. When I replace with empty string it does not help.

Alireza Roshanzamir · Accepted Answer

Because you mentioned that you want to implement your solution in Java, I developed a simple solution that can be easily converted to Java.

The following code reads the robot.mhtml file and dumps the content of each part to separate files in the out/ directory:

using System.Text;
using System.Text.RegularExpressions;

Encoding encoding = Encoding.GetEncoding("ISO-8859-1");

string mhtml = File.ReadAllText("./robot.mhtml", encoding);

MatchCollection matches = Regex.Matches(
    mhtml,
    @"Content-Location: .*/(?.*)

(?(
|.)+?)(?=
------MultipartBoundary--)"
);

Directory.CreateDirectory("out");

foreach (Match match in matches)
{
    File.WriteAllText("out/" + match.Groups["name"].Value, match.Groups["content"].Value, encoding);
}

I tested it, and it works:

Let me provide a complete explanation of the Regex for you:

The regex attempts to extract each part name (using the final part of the Content-Location header) and its content.
Without the Singleline flag, the . includes everything except . Therefore, when we intend to include everything, including new lines, we should use (.| ).
Following the HTTP protocol, there is a single additional between the headers and content.
The (?pattern) creates a regex group with the name group_name and a specified pattern, allowing us to request the matches to return only these specific parts from the complete match.
The +? signifies that it should not extend the text greedily. If you use a simple +, it captures content until the last ------MultipartBoundary-- (resulting in only one file being extracted). However, we aim to capture content until the first occurrence (visit here for more information).
The .+(?=sequence) implies searching until the sequence is located (see here for more information).

Some other notes:

HTTP messages are encoded with ISO-8859-1. So, you should read and write files using this encoding.
This is a file on which I tested my solution. I visited your mentioned website and downloaded the page using Chrome on Android.
To achieve the same result in Java, you should take into account the default flags and behaviors of Java's Regex. Nevertheless, I believe they are similar to those in C#.

In addition to coding and logging, you can test your customized regex in this dotnet-specific regex tester to observe the results and captured groups:

Coding of images in Blink archive

Answers (2)

Related Questions