mforrest
mforrest

Reputation: 104

Splitting an Alphabetical List Into Equal Chunks

I've been tasked with splitting a directory of names into four (approximately) equal chunks. The directory is effectively a phonebook that is already alphebatised. The solution has to be generic and work for any directory not just one specific one. If it helps the directory is an array of strings.

For example the four chunks for one directory could be:

A-E, F-L, M-S and T-Z 

Whilst another could be

A-B, C-D, E-F and G-Z

I've already considered just splitting the size of the directory in 4 and then counting upwards until reaching that number and noting the letter that entry starts with but that isn't particularly elegant.

What I mean by this is: take the directory to be 100 entries. I could divide this by 4 to get 25 (how many entries should be in each chunk). Going through the entries to 25 and then taking that entry should give me the last entry in the first chunk. However, this doesn't work when the number of entries for each letter in the alphabet vary greatly. A-J could all have one entry and K could have 32 entries which would make my process useless.

It would be helpful to have pseudocode instead of a specific C implementation but really a point in the right direction would be a great help.

Upvotes: 1

Views: 128295

Answers (2)

ecatmur
ecatmur

Reputation: 157484

This is an optimization problem in three variables; the boundaries between the four chunks. If we denote the boundaries x, y, z with the chunks half-open intervals A-x, x-y, y-z, z-Z then the only further constraint is that x <= y <= z, giving 3276 possibilities for x, y, z which is trivial to search exhaustively.

Then all you need is a way to score one configuration as better or worse than another; I'd suggest using sum of squared error e.g. for chunk lengths 20, 26, 32, 24 the squared error would be (20-25)^2 + (26-25)^2 + (32-25)^2 + (24-25)^2 = 76.

Putting this together, you can write the exhaustive search with nested loops:

best, best_error = Nothing, +Inf
for A <= x <= Z:
    for x <= y <= Z:
        for y <= z <= Z:
            error = (sum(lengths[i] for A < i <= x) - 25)^2
                  + (sum(lengths[i] for x < i <= y) - 25)^2
                  + (sum(lengths[i] for y < i <= z) - 25)^2
                  + (sum(lengths[i] for z < i <= Z) - 25)^2
            if error < best_error:
                 best, best_error = (x, y, z), error

Upvotes: 1

Tamil Maran
Tamil Maran

Reputation: 291

The directory is already sorted. so you can easily split them into four if you consider extra alphabets as keys like (A-Ae, Af-Az etc.) The basic idea is

  1. Store your dictionary in some data structure (say array) in the sorted order
  2. Now divide the length of the array by four and make four indices respectively.
  3. At each index check the letters with the previous index word.
    • like if 1st index word is "Abandon" and second index word is "Ascension" then first key would be "Ab" and second would be "As".
    • So all the words within the range (Ab-As) will be present between the keys.
    • If first two letters are the same like u expect the dictionary to be partially distributed, then go for an additional letter for the key. (like Aba - Abs)

Upvotes: 0

Related Questions