Can BDD work for Big Data ETL testing?

Question

I was wandering if anyone uses BDD for testing a Big Data ETL application? I can see how BDD can be used for testing applications having a client interact with them, but in case of Big Data ETL application there is no client interaction so its hard to see what 'When' I might use. For example: Give 100 event of type A occur And 50 event of type B occur after 5 minute Then database rows should be: |Type|Count|Bucket| |A|100|1| |B|50|2|

But that seems wrong. Any one with an insight?

Lunivore · Accepted Answer

Can you give me an example of what you'd expect to see in an ETL output?

There are a couple of responses you could give to this. One might be the different kinds of database rows you'd expect, and the fact that some of them will probably be repeated, but not others. That was something that struck me as weird, but if you're used to working with star schemas then you'll probably notice other differences instead.

Normally I'd steer people away from talking about the database, but if you're working with star schemas, I think it's OK to mention the facts and dimensions (I haven't worked with ETL a lot, but I do remember talking through specific examples of these and what I would expect to see).

The alternative is to use the client.

I saw that you said there was no client; however, there's always a client, even if it's one that might exist in the future. There are implications for ETL which run across security, performance and access, amongst others. It's worth having a client, even if it's a string-based or SQL-based toy, to explore the things which might trip you up.

Why are you doing this? What's new about the thing the business or users or customers will be able to do when this is in place, that they can't do already? And can you get hold of an example of that?

"We'll be able to understand how X is performing against Y standard."

Great. Can you give me an example of some X, some Y, and some standard? How will you measure the performance? What data will you be looking for? Should everyone be able to see that data? Can you think of any scenario where someone shouldn't be able to access that?

Those examples become the ETL equivalent of scenarios; the conversations retain the same pattern. You just end up automating them at a different level, since your API is machine-oriented rather than human-oriented, and some of your conversations will be about monitoring instead of testing. Your conversations should still be with the people.

Your "when" will be the query or report that you run, within the data, permission and security context in which you run it.

Can BDD work for Big Data ETL testing?

Answers (2)

Related Questions