MySQL Data Model to Cassandra Help?

Question

I'm trying to move a RDBMS model over to Cassandra, and having a hard time creating the schema. Here is my data model:

CREATE TABLE Domain (
    ID INT NOT NULL PRIMARY KEY,
    DomainName NVARCHAR(74) NOT NULL,
    HasBadWords BIT,
    ...
);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain1.com', 0);
INSERT INTO Domain (DomainName, HasBadWords) VALUES ('domain2.com', 0);

CREATE TABLE ZoneFile (
    ID INT NOT NULL PRIMARY KEY,
    DomainID INT NOT NULL,
    Available BIT NOT NULL,
    Nameservers NVARCHAR(MAX),
    Timestamp DATETIME NOT NULL
);
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 0, "ns1", '2010-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (1, 1, "ns2", '2011-01-01');
INSERT INTO ZoneFile (DomainID, Available, Nameservers, Timestamp) VALUES (2, 1, "ns2", '2011-01-01');

CREATE TABLE Backlinks (
    ID INT NOT NULL PRIMARY KEY,
    DomainID INT NOT NULL,
    Backlinks INT NOT NULL,
    Indexed INT NOT NULL,
    Timestamp DATETIME NOT NULL
);
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 100, 200, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 300, 600, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (1, 500, 1000, '2010-01-01');
INSERT INTO Backlinks (DomainID, Backlinks, Indexed, Timestamp) VALUES (2, 600, 1200, '2010-01-01');

From this, I've deduced that I can probably have one Keyspace: DomainData. In this keyspace, I can have a columnfamily called "Domain" which is like my Domain table in sql:

"Domain" : { //ColumnFamily
    "domain1.com" : { "HasBadWords" : 0 }, //SuperColumn
    "domain2.com" : { "HasBadWords" : 0 }  //SuperColumn
}

The next tables are where I start getting confused. ZoneFile and Backlinks are essentially supposed to store a history of results from looking up these values for each domain. So, one Domain to Many ZoneFile records. For querying purposes, I want to be able to easily get the 'newest' ZoneFile record, or a given Domain. I will need to do the same for Backlinks.

I was considering something like this, and doing a range lookup on the key for the domain, and then getting the 'last' record which should be the newest timestamp...

"ZoneFiles" : { //ColumnFamily
    "domain1.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
    "domain1.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" }, //SuperColumn
    "domain2.com:2010-01-01 12:00:00.000" : { "Available" : 0, "Nameservers" : "ns1" }, //SuperColumn
    "domain2.com:2011-01-01 12:00:00.000" : { "Available" : 1, "Nameservers" : "ns2" }  //SuperColumn
}

I'm not convinced this is the right answer, the combination of a string domain and string datetime in a key feels wrong. Could someone point me in the right direction?

EDIT:

Assuming I use:

"ZoneFiles" : {
  "domain1.com" : {
    timestamp1 : "{\"available\":1,\"nameservers\":\"ns1\"}",
    timestamp2 : "{\"available\":1,\"nameservers\":\"ns1\"}",
  }
}

How would I query a list of domain rows where the newest timestamp is older than a given date?

tom.wilkie · Accepted Answer

If I understand your question correctly, the only query you want to do on this model is "please get me the latest zonefile or backlinks for a given domain" ?

If thats the case, I would store the latest values for these in the "Domain" column family, under the domains row key, in separate columns. I would also store when this latest value was updated (the timestamp). Every time you get new values for the info in zonefile and backlinks, I would just overwrite the value in the "Domain" column family and update the timestamp.

I assume you are also keeping this historical data so you can query it, and I assume the kind of query will be "show me all the updates for a given domain between two times" (is this correct?). If so, I wouldn't manually construct a composite row key like that, since it will require you to use the Order Preserving Partitioner to get the correct results from get_range_slices. And as you probably know, load balancing with the OPP can be a difficult task.

Instead, I would have the row key be domain id, and the column key be the timestamp of the update. Then you can either pack you updates into a single value (eg using json), use super columns or use the new composite keys in 0.8. If done like this, you can use a get_slice to satisfy your query, and it will behave correctly with the Random Partitioner, making load balancing much easier.

Tom Wilkie | Acunu | www.acunu.com | @tom_wilkie

Reply to comment: "how would I query a list of domains that's most recent zonefile timestamp column is older than a given timestamp?"

You could do that by inserting into another column family:

row key: day (or hour, or some other reasonable 'bucketing') 
column key: timestamp of update 
value: domain

...every time you update the zonefile. Then, to get the most recently updated domains since t, do:

result = []
for i in day(t) ... day(now):
    result.extend(get_slice(i, range(t, '')))

This would require you to remove repeat entries from result, so would only work best when t is pretty recent. You also have to consider the load balancing for the writes, which would focus all the load on a single server (since, at any one time, you are inserting into only one row)

If these trade offs aren't appropriate, then you could look at the hadoop integrations and use that to perform this query. Or you could make other tradeoff (use the OPP, or do a read before a write to remove the duplicates, which would be v. slow)

MySQL Data Model to Cassandra Help?

Answers (1)

Related Questions