xralf
xralf

Reputation: 3642

Coding of images in Blink archive

I have a Blink archive (in mht format) saved from Chrome browser. I'm trying to convert the section

Content-Type: image/jpeg
Content-Transfer-Encoding: binary
Content-Location: https://some_url

ÿØÿà^@^PJFIF^@^A^A^A^@`^@`^@^@ÿÛ^@C^@^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^AÿÛ^
^KÿÄ^@µ^P^@^B^A^C^C^B^D^C^E^E^D^D^@^@^A}^A^B^C^@

to image file as follows

string s = "\nÿØÿà^@^PJ..."
byte [] result = System.Convert.FromBase64String(s)
File.WriteAllBytes("image.jpg", result);

And I have an error message The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.

How can I fix it? There are probably \n characters in the string. When I replace \n with empty string it does not help.

Upvotes: 3

Views: 156

Answers (2)

Alireza Roshanzamir
Alireza Roshanzamir

Reputation: 1227

Because you mentioned that you want to implement your solution in Java, I developed a simple solution that can be easily converted to Java.

The following code reads the robot.mhtml file and dumps the content of each part to separate files in the out/ directory:

using System.Text;
using System.Text.RegularExpressions;

Encoding encoding = Encoding.GetEncoding("ISO-8859-1");

string mhtml = File.ReadAllText("./robot.mhtml", encoding);

MatchCollection matches = Regex.Matches(
    mhtml,
    @"Content-Location: .*/(?<name>.*)\n\r\n(?<content>(\n|.)+?)(?=\n------MultipartBoundary--)"
);

Directory.CreateDirectory("out");

foreach (Match match in matches)
{
    File.WriteAllText("out/" + match.Groups["name"].Value, match.Groups["content"].Value, encoding);
}

I tested it, and it works:

enter image description here

Let me provide a complete explanation of the Regex for you:

  • The regex attempts to extract each part name (using the final part of the Content-Location header) and its content.
  • Without the Singleline flag, the . includes everything except \n. Therefore, when we intend to include everything, including new lines, we should use (.|\n).
  • Following the HTTP protocol, there is a single additional \r\n between the headers and content.
  • The (?<group_name>pattern) creates a regex group with the name group_name and a specified pattern, allowing us to request the matches to return only these specific parts from the complete match.
  • The +? signifies that it should not extend the text greedily. If you use a simple +, it captures content until the last \n------MultipartBoundary-- (resulting in only one file being extracted). However, we aim to capture content until the first occurrence (visit here for more information).
  • The .+(?=sequence) implies searching until the sequence is located (see here for more information).

Some other notes:

  • HTTP messages are encoded with ISO-8859-1. So, you should read and write files using this encoding.
  • This is a file on which I tested my solution. I visited your mentioned website and downloaded the page using Chrome on Android.
  • To achieve the same result in Java, you should take into account the default flags and behaviors of Java's Regex. Nevertheless, I believe they are similar to those in C#.

In addition to coding and logging, you can test your customized regex in this dotnet-specific regex tester to observe the results and captured groups:

enter image description here

Upvotes: 1

Ben Koch
Ben Koch

Reputation: 101

From the snippet you've shared, it seems the data you have is not Base64 encoded, but instead directly represents the bytes of a JPEG file (as seen from the magic number ÿØÿà at the start, which corresponds to JPEG).

If this is the case, you don't need to perform a Base64 conversion at all, you need to convert this string to bytes directly.

In C#, you can use the Encoding class to convert a string to bytes. If the string represents bytes as UTF-8, you can convert it like so:

string s = "\nÿØÿà^@^PJ...";
byte[] result = Encoding.UTF8.GetBytes(s);
File.WriteAllBytes("image.jpg", result);

Upvotes: 1

Related Questions