Dell
Dell

Reputation:

Base64 encoded string search

I have string which is base64 encoded. How can I search this string to check if this string contains specific sub string which is not encoded? I don't want to decode that string and then search it.

Can I just encode that specific sub string, and search the encoded string by using the encoded sub string?

Thanks,

Upvotes: 1

Views: 8251

Answers (6)

nnsk
nnsk

Reputation: 133

You can convert both the plaintext and the base64 string to their bytes or hex representation. Then you can search the bytes or hex as you would with any other string. Then you don't have to bitshift or anything fancy like that, as there are no differences between the string encodings once they are converted to bytes.

A javascript implementation can be found here: https://github.com/nanaknihal/js-search-plaintext-within-base64.

Upvotes: -1

recvfrom
recvfrom

Reputation: 479

As others have noted, encoding the substring and using that directly to search can be challenging. Creating a regular expression from the substring, though, can make it a little bit easier.

To walk through an example, consider the use case of determining whether Base64-encoded data corresponds to a Windows executable. Some malware contains Base64-encoded EXEs that it will execute on infected systems, and it can be useful when doing malware analysis to detect this. A Windows executable can be identified by looking for MZ at the beginning of the data and PE\x00\x00 somewhere after that.

Base64 works by interpreting a stream of bytes as 6-bit values corresponding to, by default, the characters A through Z, a through z, 0 through 9, +, and \. The chart below shows these mappings:

Base64 Character Map

To begin, first convert MZ to its binary equivalent (in this case we can assume the character set is ASCII, so M is 01001101 and Z is 01011010). Breaking these 16 bits into 6-bit groups yields:

010011 010101 1010xx
T      V      ???

Since 16 isn't evenly divisible by 6, the last two bits are dependent on the data that follows MZ. Since only two bits are missing, though, there are only four possible values for that character:

101000: o
101001: p
101010: q
101011: r

Thus, to look for MZ at the beginning of a Base64-encoded block of text, the regular expression ^TV[o-r] could be used.

Looking for PE\x00\x00 is more challenging because we don't know how many characters appear before it. There are three distinct cases to consider based on how sets of 8 bits divide into the sets of 6 bits to form the Base64 output:

  • Zero preceding bytes: No bits from the prior bytes affect the first character (0 % 6 == 0)
  • One preceding byte: Two bits from the prior byte affect the first character (8 % 6 == 2; the last two bits spill over)
  • Two preceding bytes: Four bits from the prior byte affect the first character (16 % 6 == 4; the last four bits spill over)
  • Three preceding bytes: No bits from the prior bytes affect the first character (24 % 6 == 0)
  • ... and so on

Working through the three cases:

Zero-bit shift:
P        E        \x00     \x00
01010000 01000101 00000000 00000000
becomes:
010100 000100 010100 000000 000000 00xxxx
U      E      U      A      A      [A-P]

Two-bit shift:
???      P        E        \x00     \x00
xxxxxxxx 01010000 01000101 00000000 00000000
becomes:
xxxxxx xx0101 000001 010100 000000 000000 0000xx
       [FVl1] B      F      A      A      [A-D]

Four-bit shift:
???      ???      P        E        \x00     \x00
xxxxxxxx xxxxxxxx 01010000 01000101 00000000 00000000
becomes:
xxxxxx xxxxxx xxxx01             010000 010001 010000 000000 000000
              [BFJNRVZdhlptx159] Q      R      Q      A      A

So altogether, you could use the following regex to determine whether Base64-encoded data is a Windows Executable:

^TV[o-r][A-Za-z0-9\+/]+(?:UEUAA[A-P]|[FVl1]BFAA[A-D]|[BFJNRVZdhlptx159]QRQAA)

Note that in valid executables the space between the DOS header and PE header is likely constrained, so [A-Za-z0-9\+/]+ could be replaced with a smaller bound.

Base64 character table from: https://en.wikipedia.org/wiki/Base64#Base64_table

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1502396

Assuming you know the exact form of base64 encoding involved, you could encode your string as if it occurred at each of the three offsets (start%3 == 0, start%3 == 1, start%3 == 2). You'd have to be cunning around the start and end of the string, as those characters will be affected by the surrounding data. You could then just use a normal IndexOf or whatever to check the middle part of the string, and then check the start and end more smartly.

Personally I wouldn't go to all of this trouble though - as the other suggestions recommend, just decode and then search. It's going to be much easier to get right.

Upvotes: 2

mweerden
mweerden

Reputation: 14051

The best way is probably to just to decode the string. However, if really necessary, it is possible to do this on the fly instead of a full decode followed by a search. You'll have to implement your one search and just decode only that part that you are currently inspecting. This is most likely only useful if you have very very big strings that you really do not want to (or cannot) store twice in memory.

If the string you search for is long enough, you can also encode that string three times with with different padding (e.g. '', 'x' and 'xx') and search for those without the first 4 and last 4 characters (you don't want to match the padding). When you find a match, you have to make sure the alignment corresponds with the padding and verify that the parts that you didn't match yet (due to the padding) are also in place. The latter does require some decoding, of course.

Upvotes: 8

Ken Paul
Ken Paul

Reputation: 5765

You can't just search for an encoded substring. Your search string will be encoded differently depending on where in the original string it appears. I think you will need to decode the entire string and then search for your substring.

Upvotes: -1

Kris Kumler
Kris Kumler

Reputation: 6307

The Base64 could take on several different forms or meanings with differing algorithms or implementations. Even looking at the examples on Wikipedia, one can see that the encoded values of characters may change depending on position. Short answer: no, you can't encode just the string and search in the larger encoded text.

Upvotes: 0

Related Questions