Steven M
Steven M

Reputation: 574

Querying billions of records in real time

I am working on a data driven analytical software project that produces reports and recommendation on financial data (transactions). The data consists of 1.7 billion records with 200k new records added every day. Each record describes a transaction with fairly small data (from_account, to_account, amount, time_stamp etc.).

Once written, the data does not need to be changed (so essentially it's a WORM paradigm), but the queries can become quite complex. Some of the queries are AML (Anti-Money-Laundering) logic, which looks for relations between accounts such as "U-Turn" transactions:

A->B->C->D->A

I need to run multiple queries to detect such patterns and the query times are pretty fast as long as each account has a 'normal' amount of transactions. The Problem appears if Account C (in the above example) has suddenly millions of transaction and the query runs 60 or more seconds instead of 0.5 seconds.

I am inclined to use neo4j to search for relations between accounts - but I am not sure if the searches will be fast enough. Other solutions could be in memory DBs such as MemSQL, Redis, or Aerospike - I am also looking at HBase/Hadoop or CouchDB, MongoDB.

Which stack would provide the fastest query results currently possible?

Upvotes: 2

Views: 2269

Answers (2)

Ronen Botzer
Ronen Botzer

Reputation: 7117

Each class of database has its strength, and for the AML use case you described a graph database such as neo4j would be the correct choice.?

A document store such as couchbase or Mongo would make little sense, and a key-value store, such as Aerospike and Redis would only make sense if there was a constant path length you were interested in, that could be pre-computed. As you are trying to find all paths that start at a given node, and end in it, regardless of the number of edges, that's not likely.

Upvotes: 1

Revln9
Revln9

Reputation: 847

I'll suggest you to choose a memory based database with the appropriate machine 8 or 16 gigs of ram. For implementing the analytics writes try with job queues ex : rabbitMQ at least for the 1.7 billion record . redis or memcache can handle your daily writes (200k) without problem or even tweaking especially if you don't really need transactions (read about batch method for redis).

here is an interesting post about how instagram use redis to map over 300 millions pictures to each user .

http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value

http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

But be aware that this kind of memory database is a key value store which means no complex data and no complex queries.

An alternative would be to try elasticsearch , it has some benefits that could make the task easier . Verizon uses it to store over 500 Billion of records.That doesn't mean everybody could do that , but at least it shows that it could be done

see this link for elasticsearch :

https://sematext.com/blog/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/

I heard that HBase/Hadoop and CouchDB performs well on large sets but can't give you much more informations as i don't really use it .

hope this helps !

Upvotes: 0

Related Questions