Reputation: 3906

Unit tests for a parser

How to write unit tests for the following scenario?

The program to test is a parser, that recognizes different structes in its input. You can think of it as parsing a markup language.

A core problem is recognizing structures in the text whenever certain patterns match. In practice those will not be regular expressions but for this questions it should be fine to think of them as such.

The problem

Assume I want to identify room numbers and have a pattern p to match them in parts of the input that do not match any other pattern p2 from a set of patterns (for example, header and footer sections of my imaginary input document).

I could imagine writing unit tests that expect several room numbers to be found for a given input. However, coming up with good tests, border cases in particular, is really problematic here.

Testing:

Interesting test-cases should somehow account for different combinations of patterns. In particular deciding if some text matches a pattern for room numbers in trivial and more of a (granted, still important) nice-to-have unit test. I can distinguish several kinds of tests:

1.:
"007" - expect: false
"01-001" - expect: true
"R02-33b" - expect: true
"01-001andsometext" - expect: false
"01-001 andsometext" - expect: true
"02-33X" - expect: false
"" - expect: false

2.:
"We meet in R01-001. Please invite agent 007." -- expect 1 matching rooms
"Excercise groups take place in 02-23b and 02-33c." --- expect 2 matches
... 

3.:
Integration test style. 
Long input with room numbers in the texts and in header/footer 
where I only want to recognize n rooms:
"... 150+ character string ..." - expect exactly 7 matches, 
                                  check if the right ones are matched

While the first one is a perfectly fine unit test that really tests only a very fixed part of my program, it is also really easy to forgot about the difficult cases. Looking at the second example, I might say to myself: "Man, I should have really included a test-case where to room number is followed by a fullstop, questionmark, etc.".

However, thr second exmaple isn't really much better. In particular I can still miss many "border-cases" because there are just so many (other punctuation, Unicode, etc) for a parser.

But even beyond that, what I really want is not only to detect room numbers of a certain format, but also dismiss those that are in "bad" sections. Testing the parsing omn a "real / typical" seems to be terrible unit test practice: barely readable, the expected result is prone to changes to other parts (the set "bad" patterns) of my program and so on.

My conclusion so far

Somehow I think I'll want to write the typical unit tests - much like in my example 1. However, I think I'll need many different kinds of input and still miss out a lot more border cases (e.g. unicode punctuation and so on) than I usually do for my other functions that work on numbers, trees, graphs, etc. So I really hope for some advice from someone you has more experience dealing with string input.

Upvotes: 2

Answers (3)

Aaron Digulla

Reputation: 328870

There are several sources of information that you can use to come up with test cases:

The requirements documents. Obviously, your code should be able to handle all the cases mentioned here
The code. Read the code and ask yourself: What kind of input would I need to execute this line of code? Code coverage tools help a lot here. When the test cases execute each line of the code and each possible combinations of conditions in all if statements, you should gave covered your ground pretty thoroughly
Think of input that your code should not accept. When parsing numbers, how about these inputs: 0, 0., .0, 0.0, .
Bug reports. Eventually, bug reports will come in. Since bug reports essentially mean "this is stuff that you usually make mistakes", each bug report should become a unit test.

Upvotes: 1

k.m

Reputation: 31484

Did you consider data driven testing? In situations like yours, it's impossible to test all possible inputs (well, when is that possible really?), and having some well defined set of input data might help. Most modern testing frameworks (eg. JUnit or NUnit) do allow such testing (eg. specifying input cases in file/database).

My bet here would go to having separate tests for really edge cases (and since parser is usually such a general tool, those might also benefit from DDT - unless you can come up with very specific edge cases that is) and remaining tests, in prove it works-style simply fed from some larger input file/datasource.

Upvotes: 0

khmarbaise

Reputation: 97567

In my opinion and experience the best approach is to check the code-coverage which is created by the unit tests. Based on the result you can see what areas of your code is not tested and this gives you hints where you have to write tests for. In such circumstances you always have the problem that you might miss some edge cases.