Nick Russler
Nick Russler

Reputation: 4688

Optimize byte array simple pattern matching

For an excercisement i had to look for a certain byte pattern within an byte array, easy enough but i wonder if the code can be simplified or even be optimized:

package anti_virus;

import java.nio.file.Files;
import java.nio.file.Paths;

public class Main {

    public static void main(String[] args) throws Exception {
        byte[] virus = Files.readAllBytes(Paths.get("C:/Users/Nick/Desktop/Uni/infected.com"));

        byte[] payload = new byte[]{0x56, 0x69, 0x72, 0x75, 0x73, (byte)0xB4, 0x40, (byte) 0xBB, 0x01,
                0x00, (byte) 0xB9, 0x05, 0x00, (byte) 0xBA, 0x0, 0x0, (byte) 0xCD, 0x21};

        // payload[14] and payload[14] have varying values

        for (int i = 0; i < virus.length; i++) {
            if ((virus[i] == payload[0]) && (virus[i+1] == payload[1]) && (virus[i+2] == payload[2]) &&
                (virus[i+3] == payload[3]) && (virus[i+4] == payload[4]) && (virus[i+5] == payload[5]) &&
                (virus[i+6] == payload[6]) && (virus[i+7] == payload[7]) && (virus[i+8] == payload[8]) &&
                (virus[i+9] == payload[9]) && (virus[i+10] == payload[10]) && (virus[i+11] == payload[11]) &&
                (virus[i+12] == payload[12]) && (virus[i+13] == payload[13]) && (virus[i+16] == payload[16]) &&
                (virus[i+17] == payload[17])) {
                  System.out.println("This file is probably a Virus!");
                  return;
            }
        }

        System.out.println("This file is no Virus.");
    }
}

Upvotes: 1

Views: 1613

Answers (4)

TrapII
TrapII

Reputation: 2275

Your code is quite good, It gives a reasonable 21 ms on a non-virus 6 MB file. But I found it better to either make some pre-loop for the first 14 bytes. Moreover, you have to take care at the ending bytes.

begin = System.currentTimeMillis();
for (i = 0; i < virus.length-payload.length; i++) {
    for (j = 0; j < 14; j++) {
        // payload[14] and payload[15] have varying values
        if (virus[i+j] != payload[j]) {
            bFound = false;
            break;
        }
    }
    if ((bFound) && (virus[i+16] == payload[16]) && (virus[i+17] == payload[17])) {
        end = System.currentTimeMillis();
        System.out.println("time : "+(end-begin)+" ms");
        System.out.println("This file is probably a Virus!");
        return;
    }
}
end = System.currentTimeMillis();
System.out.println("time : "+(end-begin)+" ms");
System.out.println("This file is not a Virus.");

This first optim give a a reasonnable 14 ms (-33% of CPU).

Another optimization if you can afford reading your file as integer, is to make wide comparison (4 bytes) at a time. You should pad also the payload to a multiple of 4.

begin = System.currentTimeMillis();
for (i = 0; i < virusInt.length-payloadInt.length; i++) {
    if ((virusInt[i] == payloadInt[0]) && 
        (virusInt[i+1] == payloadInt[1]) && 
        (virusInt[i+2] == payloadInt[2]) &&
        ((virusInt[i+3]&0xFFFF0000) == payloadInt[3]) && 
        ((virusInt[i+4]&0xFFFF0000) == payloadInt[4])) {
           end = System.currentTimeMillis();
           System.out.println("time : "+(end-begin)+" ms");
           System.out.println("This file is probably a Virus!");
           return;
       }
}
end = System.currentTimeMillis();
System.out.println("time : "+(end-begin)+" ms");
System.out.println("This file is not a Virus.");

This give me an even more reasonable 2 ms (-90% of CPU). Of course, I do not count the time to transform into array of int as I suppose you load as array of int and your payload is an array of int too. I have not tried with long (which are 64 bits in JAVA) but it could be a little bit faster.

Upvotes: 1

fabian
fabian

Reputation: 82461

Yes, it can be simplified/optimized:

  • You could use the KMP algorithm (first 14 bytes). This algorithm runs in O(payload.length + virus.length) for arbitrary payload instead of O(payload.length * virus.length). (Your code is working more efficiently than O(payload.length * virus.length) for only one reason: 0x56 occurs only as the first element of your array)
  • Even if you choose to keep your algorithm, I'd use a loop to make the code shorter & more readable. I'd also fix the source of ArrayIndexOutOfBoundsException in your loop (you may access indices i, ..., i+13, i+16, i+17 of the virus array and your loop condition allows i to get as large as virus.length-1).

Upvotes: 3

epoch
epoch

Reputation: 16605

Something like this would check for the signature anywhere within the array, it has not been thoroughly tested though

public static void main(String[] args) throws Exception {
    byte[] virus = FileUtil.readBytes(new File("c:/x.txt"));
    byte[] payload = "def".getBytes();

    for (int i = 0; i < virus.length; i++) {
        if ((i + payload.length) <= virus.length) {
            boolean found = true;
            for (int j = 0; j < payload.length; j++) {
                if (virus[i + j] != payload[j]) {
                    found = false;
                    break;
                }
            }

            if (found) {
                System.out.println("This file is probably a Virus!");
                return;
            }
        } else {
            break;
        }
    }

    System.out.println("This file is no Virus.");
}

Upvotes: 0

Joop Eggen
Joop Eggen

Reputation: 109547

(Here I assume that virus is the virus signature, and payload any data. I might be wrong seeing your code.)

One has to walk the payload array for paylöadIndex in [0, payload.length - virus.length] (!) and at every step check the virus array again, in a for-loop, with a virusIndex.

Problem solution strategy. Think how you would do it on paper. You would shift the virus array over the payload array.

Upvotes: 0

Related Questions