Reputation: 1094
This is a challenge about String manipulation.
The statement below forms the GROUP BY for a SQL statement. I would like to write a method called removeDuplicates() to remove the duplicate items.
E.g.
// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK'";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US'";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in quotes with escaped single quotes
String source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
String expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in parentheses
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in parentheses with parentheses
String source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);
// Combined
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);
I initially tried to 1) split the string on comma outside quotes (Splitting on comma outside quotes), 2) make the items unique, 3) then join them together.
However, it doesn't work when to_char(DATE, 'YYYY-MM-DD') appears in the String.
Could anyone come up with something or suggest any library that helps solving this problem? Thanks in advance.
ADDED:
If we don't worry about sub-queries, the difficult part is splitting the criteria into valid elements. Trimming and making them uniqueIgnoreCase is easy to achieve.
For the splitting, I think the combination of the below should cover all the scenarios:
- split by ,
- on each element, ignore checking comma within the first ( and the last )
- on each element, ignore checking comma within the first ' and the last '
Upvotes: 2
Views: 178
Reputation: 3349
EDIT: Here is an update. It has been modified to ignore all sections between quotes and all sections between parentheses when searching for commas. It is not guaranteed to work for arbitrary SQL but passes all the situations you've described thus far.
EDIT: Updated code again to ignore close parentheses inside quotes
import java.util.ArrayList;
import java.util.Iterator;
public class Main
{
private static final String GUID = "f61916a6-3859-4cda-ae2f-209ff3802831";
public static void main(String args[])
{
// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK', to_char(DATE, '(YYYY)MM,DD'), to_char(DATE, '(YYYY)MM,DD')";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', to_char(DATE, '(YYYY)MM,DD')";
String result = removeDuplicates(source);
System.out.println(result.equals(expected));
// Comma in quotes with escaped single quotes
source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
result = removeDuplicates(source);
System.out.println(result.equals(expected));
// Comma in parentheses
source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
result = removeDuplicates(source);
System.out.println(result.equals(expected));
// Comma in parentheses with parentheses
source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
result = removeDuplicates(source);
System.out.println(result.equals(expected));
// Combined
source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
result = removeDuplicates(source);
System.out.println(result.equals(expected));
}
private static String removeDuplicates(String source)
{
// Replace escaped quotes with a GUID to make it easier to parse
source = source.replace("''", GUID);
source = source + ','; // Hacky way to get the last part to show up
ArrayList<String> elements = new ArrayList<String>();
ArrayList<Character> charArray = new ArrayList<Character>();
for (char c : source.toCharArray())
charArray.add(c);
Iterator<Character> itr = charArray.iterator();
// Identify all the elements
String thusFar = "";
while (itr.hasNext())
{
char next = itr.next();
if (next == ',')
{
thusFar = thusFar.trim();
if (!elements.contains(thusFar))
elements.add(thusFar);
thusFar = "";
continue;
}
thusFar += next;
// Ignore anything inside quotes
if (next == '\'')
{
char c;
while ((c = itr.next()) != '\'')
{
thusFar += c;
}
thusFar += c;
continue;
}
// Ignore anything inside parentheses
if (next == '(')
{
while (true)
{
char c = itr.next();
thusFar += c;
if (c == ')')
break;
// Ignore anything inside quotes inside parentheses (including a close paren)
if (c == '\'')
{
char c2 = itr.next();
while (c2 != '\'')
{
thusFar += c2;
c2 = itr.next();
}
thusFar += c2;
}
}
continue;
}
}
// Combine all the elements back together
String result = "";
for (String s : elements)
result += s + ", ";
if (result.length() > 2)
{
result = result.substring(0, result.length() - 2);
}
// Put the escaped quotes back in
result = result.replace(GUID, "''");
return result;
}
}
Upvotes: 2
Reputation: 320
If you dont have nested functions You can simply use regexp to tokenize string:
/([a-z_]+\([^\(\)]*?\))|([A-Z_]+)/g
and then remove duplicates. [a-z_]+
matches function name, \([^\(\)]*?\)
matches function params - everything exept "(" and ")".The last part ([A-Z]+)
matches uppercase field names.
For provided example it will produce something like this:
NAME
to_char(DATE, 'YYYY,MM,DD')
to_char(DATE, 'YYYY-MM-DD')
NAME
CITY
to_char(DATE, 'YYYY-MM-DD')
Upvotes: 0
Reputation:
Better use a csv library, otherwise commas inside single or double quotes (which could be nested), quotes/commas after escapes, escapes which cancel escapes you need to deal with.
https://commons.apache.org/proper/commons-csv/
Regex can not deal with nested structures. Theoretically not possible.
Upvotes: 1