AlexMTMorgan
AlexMTMorgan

Reputation: 87

How to remove "vertical duplicate" values from multi-value map?

Unfortunately, I'm working with some less-than-savoury data structures in Java and I have the task of removing what I'm calling "vertical duplicates" in a multi-value map (Map<Enum, List<String>>) in which all values (List<String>) are of the same size

Here's an example of what I mean:

{
    // Column : 1    2    3    4    5    6    7    8    9
    NUMBER : [ "1", "2", "3", "1", "2", "3", "1", "2", "3" ],
    LETTER : [ "A", "B", "C", "A", "E", "F", "G", "B", "I" ],
    SYMBOL : [ "!", "!", "!", "!", "!", "!", "!", "!", "!" ],
    ...
}

A "vertical duplicate" would be any column which has the same values as any previous column. In the above map, the duplicates would be columns [1,4] (both with values 1,A,!), and [2,8] (both with values 2,B,!).

The output from the above map, after removing "vertical duplicates":

{
    // Previous Column:
    //          1    2    3    5    6    7    9
    NUMBER : [ "1", "2", "3", "2", "3", "1", "3" ],
    LETTER : [ "A", "B", "C", "E", "F", "G", "I" ],
    SYMBOL : [ "!", "!", "!", "!", "!", "!", "!" ],
    ...
}

Is there an easy way to remove the "vertical duplicates"? I'm working with multi-value maps with varying key-set sizes (one map may have 3 different enum keys, another may have 17 different enum keys), as well as varying value-set sizes (one map may contain lists each with a size of 2, another may contain lists each with a size of 20).

Upvotes: 0

Views: 531

Answers (2)

Oleg Sklyar
Oleg Sklyar

Reputation: 10082

I would recommend to use a column-based data structure instead of a row-based for these data. At least you can/should use such a structure for this operation and you can add a simple method to turn it back into a row-wise multimap afterwards. This is how it can look like on a full functional example:

public enum Row {
    NUMBER, LETTER, SYMBOL, WHATEVER1, WHATEVER2
}

public static class Col {
    Map<Row, String> col = new HashMap<>();

    public Col(Entry<Row, String>... entries) {
        for (Entry<Row, String> entry: entries) {
            col.put(entry.getKey(), entry.getValue());
        }
    }

    // to use within a LinkedHashSet
    @Override
    public boolean equals(Object other) {
        if (this == other) return true;
        if (other == null || getClass() != other.getClass()) return false;
        return Objects.equals(col, ((Col) other).col);
    }

    @Override
    public int hashCode() { return Objects.hash(col); }

    @Override
    public String toString() { return col.toString(); }
}


public static void main(String[] argv) {
    // alternatively use LinkedHashSet directly
    List<Col> cols = new ArrayList<>();
    cols.add(new Col(new SimpleEntry<>(Row.NUMBER, "1"), new SimpleEntry<>(Row.LETTER, "A"), new SimpleEntry<>(Row.WHATEVER1, "X")));
    cols.add(new Col(new SimpleEntry<>(Row.NUMBER, "2"), new SimpleEntry<>(Row.LETTER, "B"), new SimpleEntry<>(Row.SYMBOL, "!")));
    cols.add(new Col(new SimpleEntry<>(Row.NUMBER, "1"), new SimpleEntry<>(Row.LETTER, "A"), new SimpleEntry<>(Row.WHATEVER1, "X")));

    // turn original structure unique keeping order of insertion
    Set<Col> unique = new LinkedHashSet<>(cols);

    System.out.println(unique);
}

Prints

[{LETTER=A, NUMBER=1, WHATEVER1=X}, {LETTER=B, NUMBER=2, SYMBOL=!}]

Upvotes: 2

Karol Dowbecki
Karol Dowbecki

Reputation: 44960

There are different approaches depending on time or space constraints but you can build a histogram for each column value (e.g. Map<String, Integer>) and remove all columns that has a count of 2 or more. This should be more efficient than comparing every column with every other column.

Upvotes: 0

Related Questions