ClaireG
ClaireG

Reputation: 1244

Techniques for static code analysis in detecting integer overflows

I'm trying to find some effective techniques which I can base my integer-overflow detection tool on. I know there are many ready-made detection tools out there, but I'm trying to implement a simple one on my own, both for my personal interest in this area and also for my knowledge.

I know techniques like Pattern Matching and Type Inference, but I read that more complicated code analysis techniques are required to detect the int overflows. There's also the Taint Analysis which can "flag" un-trusted sources of data.

Is there some other technique, which I might not be aware of, which is capable of detecting integer overflows?

Upvotes: 3

Views: 2480

Answers (4)

Ira Baxter
Ira Baxter

Reputation: 95334

I'd expect Frama-C to provide such a capability. Frama-C is focused on C source code, but I don't know if it is dialect-sensitive or specific. I believe it uses abstract interpretation to model values. I don't know if it specifically checks for overflows.

Our DMS Software Reengineering Toolkit has variety of langauge front ends, including most major dialects of C. It provides control and data flow analysis, and also abstract interpretation for computing ranges, as foundations on which you can build an answer. My Google Tech Talk on DMS at about 0:28:30 specifically talks about how one can use DMS's abstract interpretation on value ranges to detect overflow (of an index on a buffer). A variation on checking the upper bound on array sizes is simply to check for values not exceeding 2^N. However, off the shelf DMS does not provide any specific overflow analysis for C code. There's room for the OP to do interesting work :=}

Upvotes: 0

user555045
user555045

Reputation: 64904

It seems you are looking for some sort of Value Range Analysis, and detect when that range would exceed the set bounds. This is something that on the face of it seems simple, but is actually hard. There will be lots of false positives, and that's even without counting bugs in the implementation.

To ignore the details for a moment, you associate a pair [lower bound, upper bound] with every variable, and do some math to figure out the new bounds for every operator. For example if the code adds two variables, in your analysis you add the upper bounds together to form the new upper bound, and you add the lower bounds together to get the new lower bound.

But of course it's not that simple. Firstly, what if there is non-straight-line code? if's are not too bad, you can just evaluate both sides and then take the union of the ranges after it (which can lose information! if two ranges have a gap in between, their union will span the gap). Loops require tricks, a naive implementation may run billions of iterations of analysis on a loop or never even terminate at all. Even if you use an abstract domain that has no infinite ascending chains, you can still get into trouble. The keywords to solve this are "widening operator" and (optionally, but probably a good idea) "narrowing operator".

It's even worse than that, because what's a variable? Your regular local variable of scalar type that never has its address taken isn't too bad. But what about arrays? Now you don't even know for sure which entry is being affected - the index itself may be a range! And then there's aliasing. That's far from a solved problem and causes many real world tools to make really pessimistic assumptions.

Also, function calls. You're going to call functions from some context, hopefully a known one (if not, then it's simple: you know nothing). That makes it hard, not only is there suddenly a lot more state to keep track of at the same time, there may be several places a function could be called from, including itself. The usual response to that is to re-evaluate that function when a range of one of its arguments has been expanded, once again this could take billions of steps if not done carefully. There also algorithms that analyze a function differently for different context, which can give more accurate results, but it's easy to spend a lot of time analyzing contexts that aren't different enough to matter.

Anyway if you've made it this far, you could read Accurate Static Branch Prediction by Value Range Propagation and related papers to get a good idea of how to actually do this.

And that's not all. Considering only the ranges of individual variables without caring about the relationships between (keyword: non-relational abstract domain) them does bad on really simple (for a human reader) things such as subtracting two variables that always close together in value, for which it will make a large range, with the assumption that they may be as far apart as their bounds allow. Even for something trivial such as

; assume x in [0 .. 10]
int y = x + 2;
int diff = y - x;

For a human reader, it's pretty obvious that diff = 2. In the analysis described so far, the conclusions would be that y in [2 .. 12] and diff in [-8, 12]. Now suppose the code continues with

int foo = diff + 2;
int bar = foo - diff;

Now we get foo in [-6, 14] and bar in [-18, 22] even though bar is obviously 2 again, the range doubled again. Now this was a simple example, and you could make up some ad-hoc hacks to detect it, but it's a more general problem. This effect tends to blow up the ranges of variables quickly and generate lots of unnecessary warnings. A partial solution is assigning ranges to differences between variables, then you get what's called a difference-bound matrix (unsurprisingly this is an example of a relational abstract domain). They can get big and slow for interprocedual analysis, or if you want to throw non-scalar variables at them too, and the algorithms start to get more complicated. And they only get you so far - if you throw a multiplication in the mix (that includes x + x and variants), things still go bad very fast.

So you can throw something else in the mix that can handle multiplication by a constant, see for example Abstract Domains of Affine Relations⋆ - this is very different from ranges, and won't by itself tell you much about the ranges of your variables, but you could use it to get more accurate ranges.

The story doesn't end there, but this answer is getting long. I hope this does not discourage you from researching this topic, it's a topic that lends itself well to starting out simple and adding more and more interesting things to your analysis tool.

Upvotes: 2

icdevppl
icdevppl

Reputation: 175

Checking integer overflows in C:

When you add two 32-bit numbers and get a 33-bit result, the lower 32 bits are written to the destination, with the highest bit signaled out as a carry flag. Many languages including C don't provide a way to access this 'carry', so you can use limits i.e. <limits.h>, to check before you perform an arithmetic operation. Consider unsigned ints a and b :

if MAX - b < a, we know for sure that a + b would cause an overflow. An example is given in this C FAQ.

Watch out: As chux pointed out, this example is problematic with signed integers, because it won't handle MAX - b or MIN + b if b < 0. The example solution in the second link (below) covers all cases.

Multiplying numbers can cause an overflow, too. A solution is to double the length of the first number, then do the multiplication. Something like:

(typecast)a*b  

Watch out: (typecast)(a*b) would be incorrect because it truncates first then typecasts.

A detailed technique for c can be found HERE. Using macros seems to be an easy and elegant solution.

Upvotes: 1

Grzegorz Szpetkowski
Grzegorz Szpetkowski

Reputation: 37914

It may be worth to try with cppcheck static analysis tool, that claims to detect signed integer overflow as of version 1.67:

New checks:
- Detect shift by too many bits, signed integer overflow and dangerous sign conversion

Notice that it supports both C and C++ languages.

There is no overflow check for unsigned integers, as by Standard unsigned types never overflow.

Here is some basic example:

#include <stdio.h>

int main(void)
{
    int a = 2147483647;
    a = a + 1;

    printf("%d\n", a);

    return 0;
}

With such code it gets:

$ ./cppcheck --platform=unix64 simple.c 
Checking simple.c...
[simple.c:6]: (error) Signed integer overflow for expression 'a+1'

However I wouldn't expect too much from it (at least with current version), as slighly different program:

int a = 2147483647;
a++;

passes without noticing overflow.

Upvotes: 2

Related Questions