Ingo
Ingo

Reputation: 1572

How to add storage-level caching between DynamoDB and Titan?

I am using the Titan/DynamoDB library to use AWS DynamoDB as a backend for my Titan DB graphs. My app is very read-heavy and I noticed Titan is mostly executing query requests against DynamoDB. I am using transaction- and instance-local caches and indexes to reduce my DynamoDB read units and the overall latency. I would like to introduce a cache layer that is consistent for all my EC2 instances: A read/write-through cache between DynamoDB and my application to store query results, vertices, and edges.

Caching between Titan API and DynamoDB

I see two solutions to this:

  1. Implicit caching done directly by the Titan/DynamoDB library. Classes like the ParallelScanner could be changed to read from AWS ElastiCache first. The change would have to be applied to read & write operations to ensure consistency.
  2. Explicit caching done by the application before even invoking the Titan/Gremlin API.

The first option seems to be the more fine-grained, cross-cutting, and generic.

Upvotes: 2

Views: 791

Answers (1)

Alexander Patrikalakis
Alexander Patrikalakis

Reputation: 5195

First, ParallelScanner is not the only thing you would need to change. Most importantly, all the changes you need to make are in DynamoDBDelegate (that is the only class that makes low level DynamoDB API calls).

Regarding implicit caching, you could add a caching layer on top of DynamoDB. For example, you could implement a cache using API Gateway on top of DynamoDB, or you could use Elasticache. Either way, you need to figure out a way to invalidate Query/Scan pages. Inserting/deleting items will cause page boundaries to change so it requires some thought.

Explicit caching may be easier to do than implicit caching. The level of abstraction is higher, so based on your incoming writes it may be easier for you to decide at the application level whether a traversal that is cached needs to be invalidated. If you treat your graph application as another service, you could cache the results at the service level.

Something in between may also be possible (but requires some work). You could continue to use your vertex/database caches as provided by Titan, and use a low value for TTL that is consistent with how frequently you write columns. Or, you could take your caching approach a step further and do the following.

  1. Enable DynamoDB Stream on edgestore.
  2. Use a Lambda function to stream the edgestore updates to a Kinesis Stream.
  3. Consume the Kinesis Stream with edgestore updates in the same JVM as the Gremlin Server on each of your Gremlin Server instances. You would need to instrument the database level cache in Titan to consume the Kinesis stream and invalidate the cached columns as appropriate, in each Titan instance.

Upvotes: 1

Related Questions