Samson
Samson

Reputation: 2821

File sharding between servers algorithm

I want to distribute files across multiple servers and have them available with very little overhead. So I was thinking of the following naive algorithm:

Providing that each file has an unique ID number: 120151 I'm thinking of segmenting the files using the modulo (%) operator. This works if I know the number of servers in advance:

Example with 2 servers (stands for n servers):

  server 1 : ID % 2 = 0 (contains even IDs) 
  server 2 : ID % 2 = 1 (contains odd IDs) 

However when I need to scale this and add more servers I will have to re-shuffle the files to obey the new algorithm rules and we don't want that.

Example:

Say I add server 3 into the mix because I cannot handle the load. Server 3 will contain files that respect the following criteria:
server 3 : ID%3 = 2

Step 1 is to move the files from server 1 and server 2 where ID%3 = 2. However, I'll have to move some files between server 1 and server 2 so that the following occurs:

server 1 : ID%3 = 0
server 2 : ID%3 = 1

What's the optimal way to achieve this?

Upvotes: 3

Views: 1210

Answers (2)

miraculixx
miraculixx

Reputation: 10379

My approach would be to use consistent hashing. From Wikipedia:

Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.

The general idea is this:

  1. Think of your servers as arranged on a ring, ordered by their server_id
  2. Each server is assigned a uniformly distributed (random) id, e.g. server_id = SHA(node_name).
  3. Each file is equally assigned a uniformly distributed id, e.g. file_id = SHA(ID), where ID is as given in your example.
  4. Choose the server that is 'closest' to the file_id, i.e. where server_id > file_id (start choosing with the smallest server_id).
  5. If there is no such node, there is a wrap around on the ring

Note: you can use any hash function that generates uniformly distributed hashes, so long as you use the same hash function for both servers and files.

This way, you get to keep O(1) access, and adding/removing is straight forward and does not require reshuffling all files:

a) adding a new server, the new node gets all the files from the next node on the ring with ids lower than the new server

b) removing a server, all of its files are given to the next node on the ring

Tom White's graphically illustrated overview explains in more detail.

Upvotes: 3

Lior Kogan
Lior Kogan

Reputation: 20658

To summarize your requirements:

  1. Each server should store an (almost) equal amount of files.
  2. You should be able to determine which server holds a given file - based only on the file's ID, in O(1).
  3. When adding a file, requirements 1 and 2 should hold.
  4. When adding a server, you want to move some files to it from all existing servers, such that requirements 1 and 2 would hold.

Your strategy when adding a 3rd server (x is the file's ID):

x%6  Old   New
0    0     0
1    1     1
2    0 --> 2
3    1 --> 0
4    0 --> 1
5    1 --> 2

Alternative strategy:

x%6  Old   New
0    0     0
1    1     1
2    0     0
3    1     1
4    0 --> 2
5    1 --> 2

To locate a server after the change:

0: x%6 in [0,2]
1: x%6 in [1,3]
2: x%6 in [4,5]

Adding a 4th server:

 x%12 Old   New
 0    0     0
 1    1     1
 2    0     0
 3    1     1
 4    2     2
 5    2     2
 6    0     0
 7    1     1
 8    0 --> 3
 9    1 --> 3
10    2     2
11    2 --> 3

To locate a server after the change:

0: x%12 in [0,2, 6]
1: x%12 in [1,3, 7]
2: x%12 in [4,5,10]
3: x%12 in [8,9,11]

When you add server, you can always build a new function (actually several alternative functions). The value of the divisor for n servers equals to lcm(1,2,...,n), so it grows very fast.

Note that you didn't mention if files are removed, and if you plan to handle that.

Upvotes: 1

Related Questions