Reputation:
I have string which is base64 encoded. How can I search this string to check if this string contains specific sub string which is not encoded? I don't want to decode that string and then search it.
Can I just encode that specific sub string, and search the encoded string by using the encoded sub string?
Thanks,
Upvotes: 1
Views: 8251
Reputation: 133
You can convert both the plaintext and the base64 string to their bytes or hex representation. Then you can search the bytes or hex as you would with any other string. Then you don't have to bitshift or anything fancy like that, as there are no differences between the string encodings once they are converted to bytes.
A javascript implementation can be found here: https://github.com/nanaknihal/js-search-plaintext-within-base64.
Upvotes: -1
Reputation: 479
As others have noted, encoding the substring and using that directly to search can be challenging. Creating a regular expression from the substring, though, can make it a little bit easier.
To walk through an example, consider the use case of determining whether Base64-encoded data corresponds to a Windows executable. Some malware contains Base64-encoded EXEs that it will execute on infected systems, and it can be useful when doing malware analysis to detect this. A Windows executable can be identified by looking for MZ
at the beginning of the data and PE\x00\x00
somewhere after that.
Base64 works by interpreting a stream of bytes as 6-bit values corresponding to, by default, the characters A
through Z
, a
through z
, 0
through 9
, +
, and \
. The chart below shows these mappings:
To begin, first convert MZ
to its binary equivalent (in this case we can assume the character set is ASCII, so M
is 01001101
and Z
is 01011010
). Breaking these 16 bits into 6-bit groups yields:
010011 010101 1010xx
T V ???
Since 16 isn't evenly divisible by 6, the last two bits are dependent on the data that follows MZ
. Since only two bits are missing, though, there are only four possible values for that character:
101000: o
101001: p
101010: q
101011: r
Thus, to look for MZ
at the beginning of a Base64-encoded block of text, the regular expression ^TV[o-r]
could be used.
Looking for PE\x00\x00
is more challenging because we don't know how many characters appear before it. There are three distinct cases to consider based on how sets of 8 bits divide into the sets of 6 bits to form the Base64 output:
0 % 6 == 0
)8 % 6 == 2
; the last two bits spill over)16 % 6 == 4
; the last four bits spill over)24 % 6 == 0
)Working through the three cases:
Zero-bit shift:
P E \x00 \x00
01010000 01000101 00000000 00000000
becomes:
010100 000100 010100 000000 000000 00xxxx
U E U A A [A-P]
Two-bit shift:
??? P E \x00 \x00
xxxxxxxx 01010000 01000101 00000000 00000000
becomes:
xxxxxx xx0101 000001 010100 000000 000000 0000xx
[FVl1] B F A A [A-D]
Four-bit shift:
??? ??? P E \x00 \x00
xxxxxxxx xxxxxxxx 01010000 01000101 00000000 00000000
becomes:
xxxxxx xxxxxx xxxx01 010000 010001 010000 000000 000000
[BFJNRVZdhlptx159] Q R Q A A
So altogether, you could use the following regex to determine whether Base64-encoded data is a Windows Executable:
^TV[o-r][A-Za-z0-9\+/]+(?:UEUAA[A-P]|[FVl1]BFAA[A-D]|[BFJNRVZdhlptx159]QRQAA)
Note that in valid executables the space between the DOS header and PE header is likely constrained, so [A-Za-z0-9\+/]+
could be replaced with a smaller bound.
Base64 character table from: https://en.wikipedia.org/wiki/Base64#Base64_table
Upvotes: 1
Reputation: 1502396
Assuming you know the exact form of base64 encoding involved, you could encode your string as if it occurred at each of the three offsets (start%3 == 0, start%3 == 1, start%3 == 2). You'd have to be cunning around the start and end of the string, as those characters will be affected by the surrounding data. You could then just use a normal IndexOf or whatever to check the middle part of the string, and then check the start and end more smartly.
Personally I wouldn't go to all of this trouble though - as the other suggestions recommend, just decode and then search. It's going to be much easier to get right.
Upvotes: 2
Reputation: 14051
The best way is probably to just to decode the string. However, if really necessary, it is possible to do this on the fly instead of a full decode followed by a search. You'll have to implement your one search and just decode only that part that you are currently inspecting. This is most likely only useful if you have very very big strings that you really do not want to (or cannot) store twice in memory.
If the string you search for is long enough, you can also encode that string three times with with different padding (e.g. '', 'x' and 'xx') and search for those without the first 4 and last 4 characters (you don't want to match the padding). When you find a match, you have to make sure the alignment corresponds with the padding and verify that the parts that you didn't match yet (due to the padding) are also in place. The latter does require some decoding, of course.
Upvotes: 8
Reputation: 5765
You can't just search for an encoded substring. Your search string will be encoded differently depending on where in the original string it appears. I think you will need to decode the entire string and then search for your substring.
Upvotes: -1
Reputation: 6307
The Base64 could take on several different forms or meanings with differing algorithms or implementations. Even looking at the examples on Wikipedia, one can see that the encoded values of characters may change depending on position. Short answer: no, you can't encode just the string and search in the larger encoded text.
Upvotes: 0