ikro
ikro

Reputation: 21

Store all combinations of String array and search

I have a big static dataset in the memory that stores the following attributes of people:

[sex, age, race, marital-status, education, native-country, workclass, occupation]

Each attribute takes values from a predefined set of values, and set is of different size for each attribute. This is the dictionary:

[[Male, Female], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black], [Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse], [Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool], [United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands], [Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked], [Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces]]

I would like to have a structure that keeps all possible combinations, so that for each combination in my dataset I can store some statistics (e.g. how many times a specific combination exists in the dataset), but store some information also for combinations that don't exist in the dataset. So all combinations should be represented.

I tried producing all possible combinations using ArrayList of String[] but it takes several seconds and then searching for a specific combination using indexOf(x), where x is String[] doesn't seem to work.

public class Grid  {

// Immutable fields
private final int combinationLength;
private final String[][] values;
private final int[] maxIndexes;
private final ArrayList<String[]> GridValues = new ArrayList<String[]>();
// Mutable fields
private final int[] currentIndexes;
private boolean hasNext;

public Grid(final String[][] array) {
    combinationLength = array.length;
    values = array;
    maxIndexes = new int[combinationLength];
    currentIndexes = new int[combinationLength];

    if (combinationLength == 0) {
        hasNext = false;
        return;
    }

    hasNext = true;


    // Fill in the arrays of max indexes and current indexes.
    for (int i = 0; i < combinationLength; ++i) {
        if (values[i].length == 0) {
            // Set hasNext to false if at least one of the value-arrays is empty.
            // Stop the loop as the behavior of the iterator is already defined in this case:
            // the iterator will just return no combinations.
            hasNext = false;
            return;
        }

        maxIndexes[i] = values[i].length - 1;
        currentIndexes[i] = 0;
    }

    while (hasNext()){
        String[] nextCombination = next();
        GridValues.add(nextCombination);
    }
}


private boolean hasNext() {
    return hasNext;
}


public String[] next() {
    if (!hasNext) {
        throw new NoSuchElementException("No more combinations are available");
    }
    final String[] combination = getCombinationByCurrentIndexes();
    nextIndexesCombination();
    return combination;
}

private String[] getCombinationByCurrentIndexes() {
    final String[] combination = new String[combinationLength];
    for (int i = 0; i < combinationLength; ++i) {
        combination[i] = values[i][currentIndexes[i]];
    }
    return combination;
}

private void nextIndexesCombination() {

    for (int i = combinationLength - 1; i >= 0; --i) {
        if (currentIndexes[i] < maxIndexes[i]) {
            // Increment the current index
            ++currentIndexes[i];
            return;
        } else {
            // Current index at max: 
            // reset it to zero and "carry" to the next index
            currentIndexes[i] = 0;
        }
    }
    // If we are here, then all current indexes are at max, and there are no more combinations
    hasNext = false;
}
}

Anyone has an idea for a faster and better way to do this?

Thanks a lot!

Upvotes: 2

Views: 296

Answers (1)

Kiran K
Kiran K

Reputation: 768

I am making an assumption here- I am assuming the data does not keep changing (looking at the data does not feel like its dynamic).

I would use an a local file based HSQL DB to store the data (I chose this for speed purposes - however feel free to swap this out for a formal dB like MySQL).

The trick to get all the types counts across various dimensions is in the schema. For data mining "Star Schema" is the preferred approach. This schema will allow you to group by, count on any dimension you want. In your case the schema would probably look like:

table person - columns(id (primary key), name, age, sex_id, country_id, highest_education_id, income)
table sex - columns(id (primary key), name)
table country - columns(id (primary key), name)
table education - columns(id (primary key), name)

This way if you want to find count of all people who are from Columbia, the query would be like:

select count(*) from people where country_id = <columbia country id>

You can do even higher order queries like, find a total income of all the japanese :

select country.name, sum(people.income)
from people inner join country on people.country_id = country.id
and country.name = "Japan"

Its highly flexible and extensible.

Upvotes: 1

Related Questions