Max
Max

Reputation: 1094

Java SQL String remove duplicate items in GROUP BY

This is a challenge about String manipulation.

The statement below forms the GROUP BY for a SQL statement. I would like to write a method called removeDuplicates() to remove the duplicate items.

E.g.

// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK'";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in quotes with escaped single quotes
String source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
String expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses with parentheses
String source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Combined
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

I initially tried to 1) split the string on comma outside quotes (Splitting on comma outside quotes), 2) make the items unique, 3) then join them together.

However, it doesn't work when to_char(DATE, 'YYYY-MM-DD') appears in the String.

Could anyone come up with something or suggest any library that helps solving this problem? Thanks in advance.

ADDED:

If we don't worry about sub-queries, the difficult part is splitting the criteria into valid elements. Trimming and making them uniqueIgnoreCase is easy to achieve.

For the splitting, I think the combination of the below should cover all the scenarios:

- split by ,
- on each element, ignore checking comma within the first ( and the last )
- on each element, ignore checking comma within the first ' and the last '

Upvotes: 2

Views: 178

Answers (3)

Daniel Centore
Daniel Centore

Reputation: 3349

EDIT: Here is an update. It has been modified to ignore all sections between quotes and all sections between parentheses when searching for commas. It is not guaranteed to work for arbitrary SQL but passes all the situations you've described thus far.

EDIT: Updated code again to ignore close parentheses inside quotes

import java.util.ArrayList;
import java.util.Iterator;

public class Main
{
    private static final String GUID = "f61916a6-3859-4cda-ae2f-209ff3802831";

    public static void main(String args[])
    {
        // Comma in quotes
        String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK', to_char(DATE, '(YYYY)MM,DD'), to_char(DATE, '(YYYY)MM,DD')";
        String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', to_char(DATE, '(YYYY)MM,DD')";
        String result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in quotes with escaped single quotes
        source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
        expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses with parentheses
        source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Combined
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));
    }

    private static String removeDuplicates(String source)
    {
        // Replace escaped quotes with a GUID to make it easier to parse
        source = source.replace("''", GUID);

        source = source + ','; // Hacky way to get the last part to show up

        ArrayList<String> elements = new ArrayList<String>();

        ArrayList<Character> charArray = new ArrayList<Character>();

        for (char c : source.toCharArray())
            charArray.add(c);

        Iterator<Character> itr = charArray.iterator();

        // Identify all the elements
        String thusFar = "";
        while (itr.hasNext())
        {
            char next = itr.next();

            if (next == ',')
            {
                thusFar = thusFar.trim();
                if (!elements.contains(thusFar))
                    elements.add(thusFar);
                thusFar = "";
                continue;
            }

            thusFar += next;

            // Ignore anything inside quotes
            if (next == '\'')
            {
                char c;
                while ((c = itr.next()) != '\'')
                {
                    thusFar += c;
                }
                thusFar += c;
                continue;
            }

            // Ignore anything inside parentheses
            if (next == '(')
            {
                while (true)
                {
                    char c = itr.next();
                    thusFar += c;

                    if (c == ')')
                        break;

                    // Ignore anything inside quotes inside parentheses (including a close paren)
                    if (c == '\'')
                    {
                        char c2 = itr.next();
                        while (c2 != '\'')
                        {
                            thusFar += c2;
                            c2 = itr.next();
                        }
                        thusFar += c2;
                    }
                }

                continue;
            }
        }

        // Combine all the elements back together
        String result = "";

        for (String s : elements)
            result += s + ", ";

        if (result.length() > 2)
        {
            result = result.substring(0, result.length() - 2);
        }

        // Put the escaped quotes back in
        result = result.replace(GUID, "''");

        return result;
    }
}

Upvotes: 2

Maciej Stępyra
Maciej Stępyra

Reputation: 320

If you dont have nested functions You can simply use regexp to tokenize string:

/([a-z_]+\([^\(\)]*?\))|([A-Z_]+)/g

and then remove duplicates. [a-z_]+ matches function name, \([^\(\)]*?\) matches function params - everything exept "(" and ")".The last part ([A-Z]+) matches uppercase field names.

For provided example it will produce something like this:

NAME
to_char(DATE, 'YYYY,MM,DD')
to_char(DATE, 'YYYY-MM-DD')
NAME
CITY
to_char(DATE, 'YYYY-MM-DD')

Upvotes: 0

user5221125
user5221125

Reputation:

Better use a csv library, otherwise commas inside single or double quotes (which could be nested), quotes/commas after escapes, escapes which cancel escapes you need to deal with.
https://commons.apache.org/proper/commons-csv/

Regex can not deal with nested structures. Theoretically not possible.

Upvotes: 1

Related Questions