Reputation: 3642
I have a Blink archive (in mht format) saved from Chrome browser. I'm trying to convert the section
Content-Type: image/jpeg
Content-Transfer-Encoding: binary
Content-Location: https://some_url
ÿØÿà^@^PJFIF^@^A^A^A^@`^@`^@^@ÿÛ^@C^@^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^AÿÛ^
^KÿÄ^@µ^P^@^B^A^C^C^B^D^C^E^E^D^D^@^@^A}^A^B^C^@
to image file as follows
string s = "\nÿØÿà^@^PJ..."
byte [] result = System.Convert.FromBase64String(s)
File.WriteAllBytes("image.jpg", result);
And I have an error message The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
How can I fix it? There are probably \n
characters in the string. When I replace \n
with empty string it does not help.
Upvotes: 3
Views: 156
Reputation: 1227
Because you mentioned that you want to implement your solution in Java, I developed a simple solution that can be easily converted to Java.
The following code reads the robot.mhtml
file and dumps the content of each part to separate files in the out/
directory:
using System.Text;
using System.Text.RegularExpressions;
Encoding encoding = Encoding.GetEncoding("ISO-8859-1");
string mhtml = File.ReadAllText("./robot.mhtml", encoding);
MatchCollection matches = Regex.Matches(
mhtml,
@"Content-Location: .*/(?<name>.*)\n\r\n(?<content>(\n|.)+?)(?=\n------MultipartBoundary--)"
);
Directory.CreateDirectory("out");
foreach (Match match in matches)
{
File.WriteAllText("out/" + match.Groups["name"].Value, match.Groups["content"].Value, encoding);
}
I tested it, and it works:
Let me provide a complete explanation of the Regex for you:
Content-Location
header) and its content..
includes everything except \n
. Therefore, when we intend to include everything, including new lines, we should use (.|\n)
.\r\n
between the headers and content.(?<group_name>pattern)
creates a regex group with the name group_name
and a specified pattern
, allowing us to request the matches to return only these specific parts from the complete match.+?
signifies that it should not extend the text greedily. If you use a simple +
, it captures content until the last \n------MultipartBoundary--
(resulting in only one file being extracted). However, we aim to capture content until the first occurrence (visit here for more information)..+(?=sequence)
implies searching until the sequence
is located (see here for more information).Some other notes:
ISO-8859-1
. So, you should read and write files using this encoding.In addition to coding and logging, you can test your customized regex in this dotnet-specific regex tester to observe the results and captured groups:
Upvotes: 1
Reputation: 101
From the snippet you've shared, it seems the data you have is not Base64 encoded, but instead directly represents the bytes of a JPEG file (as seen from the magic number ÿØÿà at the start, which corresponds to JPEG).
If this is the case, you don't need to perform a Base64 conversion at all, you need to convert this string to bytes directly.
In C#, you can use the Encoding class to convert a string to bytes. If the string represents bytes as UTF-8, you can convert it like so:
string s = "\nÿØÿà^@^PJ...";
byte[] result = Encoding.UTF8.GetBytes(s);
File.WriteAllBytes("image.jpg", result);
Upvotes: 1