j03m
j03m

Reputation: 5301

Approaches to unit testing big data

Imagine you are designing a system and you want to start writing tests that determine functionality - but also performance and scalability. Are there any techniques you can share for handling large sets of data in different environments?

Upvotes: 4

Views: 2479

Answers (4)

Tri
Tri

Reputation: 1

There are two situations that I encountered:

  1. large HDFS datasets that serve as Data Warehouse or data sink for other applications

  2. Apps with HBASE or other distributed databases

Tips for unit testing in both cases:-

a. Test the different functional components of the app first, there is no special rule for Big Data apps; just like any other app, unit testing should ascertain whether different components on an app are working as expected or not; then you could integrate functions/services/components, etc to do the SIT, if applicable

b. Specifically if HBASE or any other distributed data base is there, please test what is required of the DB. For e.g. distributed data bases often do not support ACID properties like traditional databases and instead are limited by something called CAP theorem (Consistency, Availability, Partition tolerance); usually 2 among 3 are guaranteed. For most RDBMS it is CA, usually for HBASE it is CP and Cassandra AP. As a designer or test planner, you should know, depending on your applications features, which is the CAP constraint for your distributed data base, and accordingly create test plan to check the real situation

  1. Regarding performance - Again a lot depends on the infrastructure and app design. Also sometimes certain s/w implementations are more taxing than others. You might check the amount of partitioning for example, its all case based

  2. Regarding Scalability - the very advantage of a Big data implementation is it is easily scalable compared to a traditional architecture. I have never thought of this as testable. For most big data apps, you can easily scale up, particularly horizontal scaling up is very easy, so not sure if anyone thinks about testing scalability for most apps.

Upvotes: 0

Joshua Fox
Joshua Fox

Reputation: 19675

Separate the different types of test.

  1. Functional testing should come first, starting unit tests with small amounts of mock data.
  2. Next, integration tests, with small amounts of data in a data store, though obviously not the same instance as the store with the large data sets.
  3. You maybe able to reduce your development effort by doing performance and scalability tests together.

One important tip: Your test data set should be as realistic as possible. Use production data, anonymizing as necessary. Because big data performance depends on statistical distribtions in the data, you don't want to use synthetic data. For example, if you use fake user data that basically has the same user info a million times, you will get very different scalability results as opposed to real-life messy user data with a wide distribution of values.

For more specific advice, I'd need to know the technology you're using. In Hadoop, look at MRUnit. For RDBs, DBUnit. Apache Bigtop can provide inspiration, though it is aimed at core projects on Hadoop rather than specific application-level projects.

Upvotes: 0

Brian Geihsler
Brian Geihsler

Reputation: 2087

I would strongly recommend prioritizing functionality tests (using TDD as your development workflow) before working in performance and scalability tests. TDD will ensure your code is well designed and loosely coupled which will make it much, much, easier down the road to create automated performance and scalability. When your code is loosely coupled, you get control over your dependencies. When you have control over your dependencies, you can create any configuration you want for any high level test you want to write.

Upvotes: 2

Bek Raupov
Bek Raupov

Reputation: 3777

for testing and measuring the performance, you could use static data sources and input (in could be huge dump file or sqlite DB).

you can create test and include it in your intergration build so that it particular function call takes more than X seconds, throw error.

as you build up more of your system, you will see that number increase and break your test.

you could spend 20% of your time to get 80% of the functionality, the rest of the 80% goes to performance and scalability :)

Scalability - think about Service oriented architechture so that you can have load balancer in between and you can increase your state/processing by simply adding new hardware/service to your system

Upvotes: -1

Related Questions