After a lot of reading and experimentation, it seems like I want PRIMARY strength for searching, but TERTIARY or IDENTICAL for ordering. Main question: Is that possible to achieve with H2 (or any other DB)? Secondary question: Am I the only one here or would any of you also like the above combination? Some confirmation would be helpful for my sanity. Background: It seems like the collation can only be set at the very beginning when creating the database. So I want to make sure to pick the right one. I am mainly thinking of these use cases (for now): A search field where the user can start typing to filter a table: Here PRIMARY seems the most appropriate, in order to avoid missing any results (user is used to Google...). Although, it would be nice to be able to give the user the option to enable secondary or tertiary collation to do more precise searching. Ordering: When the user clicks a table column to order the contents, TERTIARY/IDENTICAL ordering seems appropriate. That's what I am used to from everyday experience. I read the offical H2 docs here: http://www.h2database.com/html/commands.html#set_collation . and here: http://www.h2database.com/html/datatypes.html#varchar_ignorecase_type Some more related info: Collation STRENGTH and local language relation The test sql (from https://groups.google.com/forum/?fromgroups=#!topic/h2-database/lBksrrcuGdY ): drop all objects; set collation english STRENGTH PRIMARY; create table test(name varchar); insert into test values ('À'), ('Ä'), ('Â'), ('A'), ('à'), ('ä'), ('â'), ('a'), ('àa'), ('äa'), ('âa'), ('aa'), ('B'), ('b'); select * from test where name like 'a' order by name; select * from test order by name;

Reputation: 5808

H2 database collation: what to choose?

After a lot of reading and experimentation, it seems like I want PRIMARY strength for searching, but TERTIARY or IDENTICAL for ordering. Main question: Is that possible to achieve with H2 (or any other DB)?

Secondary question: Am I the only one here or would any of you also like the above combination? Some confirmation would be helpful for my sanity.

Background: It seems like the collation can only be set at the very beginning when creating the database. So I want to make sure to pick the right one. I am mainly thinking of these use cases (for now):

A search field where the user can start typing to filter a table: Here PRIMARY seems the most appropriate, in order to avoid missing any results (user is used to Google...). Although, it would be nice to be able to give the user the option to enable secondary or tertiary collation to do more precise searching.
Ordering: When the user clicks a table column to order the contents, TERTIARY/IDENTICAL ordering seems appropriate. That's what I am used to from everyday experience.

I read the offical H2 docs here: http://www.h2database.com/html/commands.html#set_collation. and here: http://www.h2database.com/html/datatypes.html#varchar_ignorecase_type Some more related info: Collation STRENGTH and local language relation

The test sql (from https://groups.google.com/forum/?fromgroups=#!topic/h2-database/lBksrrcuGdY):

drop all objects;
set collation english STRENGTH PRIMARY;
create table test(name varchar);
insert into test values ('À'), ('Ä'), ('Â'), ('A'), ('à'), ('ä'), ('â'), ('a'), ('àa'), ('äa'), ('âa'), ('aa'), ('B'), ('b');
select * from test where name like 'a' order by name;
select * from test order by name;

enter image description here

Upvotes: 5

Answers (2)

Reto Höhener

Reputation: 5808

Almost 8 years later, my own recommendation based on some hard learnings:

Use no collation at all (default for H2 databases).

Rationale: Using a collation will produce some really unexpected results and bugs.

Pitfall: UNIQUE constraints

By far the most common unique constraints i saw in daily business was to enforce unique (firstname, lastname). Typically, case should be ignored (prevent both 'thomas müller' and 'Thomas Müller'), but not umlauts (allow both 'Thomas Müller' and 'Thomas Muller').

It might be tempting to use a collation strength SECONDARY setting to achieve this (case-insensitive but umlaut-sensitive). Don't. Use VARCHAR_IGNORECASE columns instead.

{
  // NOT recommended: using SECONDARY collation
  Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
  s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
  s.execute("CREATE TABLE test ( name VARCHAR )");
  s.execute("ALTER  TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
  s.execute("INSERT INTO  test (name) VALUES ('Müller')");
  s.execute("INSERT INTO  test (name) VALUES ('Muller')");
  // s.execute("INSERT INTO  test (name) VALUES ('muller')" /* will fail */);
}
{
  // recommended: no collation, using VARCHAR_IGNORECASE instead of VARCHAR column
  Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
  s.execute("CREATE TABLE test ( name VARCHAR_IGNORECASE )");
  s.execute("ALTER  TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
  s.execute("INSERT INTO  test (name) VALUES ('Müller')");
  s.execute("INSERT INTO  test (name) VALUES ('Muller')");
  // s.execute("INSERT INTO test (name) VALUES ('muller')" /* will fail */);
}

Pitfall: Searching / WHERE clauses

Recommendation: The default behavior without collation is just fine, and behaves as expected. For more fuzzy searching use your own code search or a library like Lucene.

SECONDARY collation strength will match even if case is different. You will not expect that behavior when using SELECT WHERE name = '...', because you will forget all about your collation setting.

{
  Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
  s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
  s.execute("CREATE TABLE test ( name VARCHAR )");
  s.execute("INSERT INTO  test (name) VALUES ('Thomas Müller')");
  ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'Thomas müller'" /* different case */);
  rs.next();
  /* prints 1 (!) */ System.out.println(rs.getLong(1));
}

PRIMARY collation strength will match even if SPACES are different. Would you believe the English primary collation ignores spaces? Check out this nugget: https://stackoverflow.com/a/16567963/1124509

{      
  Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
  s.execute("SET COLLATION ENGLISH STRENGTH PRIMARY");
  s.execute("CREATE TABLE test ( name VARCHAR )");
  s.execute("INSERT INTO  test (name) VALUES ('Thomas Müller')");
  ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'ThomasMüller'" /* no space! */);
  rs.next();
  /* prints 1 (!) */ System.out.println(rs.getLong(1));
}

Sorting / ORDER BY clauses

The default ordering without collation is not really useful in real-world scenarios, as it will sort according to strict string comparison. Solve this by first loading the data from the database, and then order / sort it with code.

Personally, I mostly use an English primary strength collator with the spaces problem fixed. Works fine even for non-English text columns.

But you might also need to use a custom comparator to satisfy more difficult requirements like natural or intuitive sort orders, e.g. sort like windows explorer, or semantic versioning.

{
  Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
  s.execute("CREATE TABLE test ( name VARCHAR )");
  s.execute("INSERT INTO  test (name) VALUES ('é6')");
  s.execute("INSERT INTO  test (name) VALUES ('e5')");
  s.execute("INSERT INTO  test (name) VALUES ('E4')");
  s.execute("INSERT INTO  test (name) VALUES ('ä3')");
  s.execute("INSERT INTO  test (name) VALUES ('a2')");
  s.execute("INSERT INTO  test (name) VALUES ('A1')");
  ResultSet rs = s.executeQuery("SELECT name FROM test ORDER BY name");
  List<String> names = new ArrayList<>();
  while(rs.next()) {
    names.add(rs.getString(1));
  }
  // not very useful strict String.compareTo() result: [A1, E4, a2, e5, ä3, é6]
  System.out.print(names);

  String rules = ((RuleBasedCollator) Collator.getInstance(new Locale("en", "US"))).getRules();
  Collator collator = new RuleBasedCollator(rules.replaceAll("<'\u005f'", "<' '<'\u005f'"));
  collator.setStrength(Collator.PRIMARY);
  names.sort((a, b) -> collator.compare(a, b));
  // as humans usually expect it in a name list / table: [A1, a2, ä3, E4, e5, é6]
  System.out.print(names);
}

How to check if your H2 database is using a collation?

Look at the SETTINGS table. If no collation is set, there will be no entry in the table.

Upvotes: 3

LoganMzz

Reputation: 1623

If you want to have two behaviours for a single data you have to:

split data over two columns,
or uses two operator sets.

For your purpose, it is common to store "canonical" representation of a raw data in order to search on canonical form and then sort/display raw data. May be you should use some "text search engine" such as Apache Lucene.

For pure H2 solutions, you can use H2 alias with Computed columns or with query criteria. First solution allows indexing to speed up your queries.

Upvotes: 2