bsem
bsem

Reputation: 185

Case insensitive sorting of an array of strings

Basically, I have to use selection sort to sort a string[]. I have done this part but this is what I am having difficulty with.

The sort, however, should be case-insensitive, so that "antenna" would come before "Jupiter". ASCII sorts from uppercase to lowercase, so would there not be a way to just swap the order of the sorted string? Or is there a simpler solution?

void stringSort(string array[], int size) {
    int startScan, minIndex;
    string minValue;

    for(startScan = 0 ; startScan < (size - 1); startScan++) {
        minIndex = startScan;
        minValue = array[startScan];

        for (int index = startScan + 1; index < size; index++) {
            if (array[index] < minValue) {
                minValue = array[index];
                minIndex = index;
            }
        }
        array[minIndex] = array[startScan];
        array[startScan] = minValue;
    }
}

Upvotes: 9

Views: 12922

Answers (6)

Peter Cordes
Peter Cordes

Reputation: 364160

Instead of the < operator, use a case-insensitive string comparison function.

C89/C99 provide strcoll (string collate), which does a locale-aware string comparison. It's available in C++ as std::strcoll. In some (most?) locales, like en_CA.UTF-8, A and a (and all accented forms of either) are in the same equivalence class. I think strcoll only compares within an equivalence class as a tiebreak if the whole string is otherwise equal, which gives a very similar sort order to a case-insensitive compare. Collation (at least in English locales on GNU/Linux) ignores some characters (like [). So ls /usr/share | sort gives output like

acpi-support
adduser
ADM_scripts
aglfn
aisleriot

I pipe through sort because ls does its own sorting, which isn't quite the same as sort's locale-based sorting.

If you want to sort some user-input arbitrary strings into an order that the user will see directly, locale-aware string comparison is usually what you want. Strings that differ only in case or accents won't compare equal, so this won't work if you were using a stable sort and depending on case-differing strings to compare equal, but otherwise you get nice results. Depending on the use-case, nicer than plain case-insensitive comparison.

FreeBSD's strcoll was and maybe still is case sensitive for locales other than POSIX (ASCII). That forum post suggests that on most other systems it is not case senstive.

MSVC provides a _stricoll for case-insensitive collation, implying that its normal strcoll is case sensitive. However, this might just mean that the fallback to comparing within an equivalence class doesn't happen. Maybe someone can test the following example with MSVC.


// strcoll.c: show that these strings sort in a different order, depending on locale
#include <stdio.h>
#include <locale.h>

int main()
{
    // TODO: try some strings containing characters like '[' that strcoll ignores completely.
    const char * s[] = { "FooBar - abc", "Foobar - bcd", "FooBar - cde" };
#ifdef USE_LOCALE
    setlocale(LC_ALL, ""); // empty string means look at env vars
#endif
    strcoll(s[0], s[1]);
    strcoll(s[0], s[2]);
    strcoll(s[1], s[2]);
    return 0;
}

output of gcc -DUSE_LOCALE -Og strcoll.c && ltrace ./a.out (or run LANG=C ltrace a.out):

__libc_start_main(0x400586, 1, ...
setlocale(LC_ALL, "")                        = "en_CA.UTF-8"   # my env contains LANG=en_CA.UTF-8
strcoll("FooBar - abc", "Foobar - bcd")      = -1
strcoll("FooBar - abc", "FooBar - cde")      = -2
strcoll("Foobar - bcd", "FooBar - cde")      = -1
  # the three strings are in order
+++ exited (status 0) +++

with gcc -Og -UUSE_LOCALE strcoll.c && ltrace ./a.out:

__libc_start_main(0x400536, ...
# no setlocale, so current locale is C
strcoll("FooBar - abc", "Foobar - bcd")      = -32
strcoll("FooBar - abc", "FooBar - cde")      = -2
strcoll("Foobar - bcd", "FooBar - cde")      = 32   # s[1] should sort after s[2], so it's out of order
+++ exited (status 0) +++

POSIX.1-2001 provides strcasecmp. The POSIX spec says the results are "unspecified" for locales other than plain-ASCII, though, so I'm not sure whether common implementations handle utf-8 correctly or not.

See this post for portability issues with strcasecmp, e.g. to Windows. See other answers on that question for other C++ ways of doing case-insensitive string compares.


Once you have a case-insensitive comparison function, you can use it with other sort algorithms, like C standard lib qsort, or c++ std::sort, instead of writing your own O(n^2) selection-sort.


As b.buchhold's answer points out, doing a case-insensitive comparison on the fly might be slower than converting everything to lowercase once, and sorting an array of indices. The lowercase-version of each strings is needed many times. std::strxfrm will transform a string so that strcmp on the result will give the same result as strcoll on the original string.

Upvotes: 1

Jonathan Mee
Jonathan Mee

Reputation: 38919

C++ provides you with sort which takes a comparison function. In your case with a vector<string> you'll be comparing two strings. The comparison function should return true if the first argument is smaller.

For our comparison function we'll want to find the first mismatched character between the strings after tolower has been applied. To do this we can use mismatch which takes a comparator between two characters returning true as long as they are equal:

const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});

To decide if the lhs is smaller than the rhs fed to mismatch we need to test 3 things:

  1. Were the strings of unequal length
  2. Was string lhs shorter
  3. Or was the first mismatched char from lhs smaller than the first mismatched char from rhs

This evaluation can be performed by:

result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));

Ultimately, we'll want to wrap this up in a lambda and plug it back into sort as our comparator:

sort(foo.begin(), foo.end(), [](const unsigned char lhs, const unsigned char rhs){
    const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});

    return result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
});

This will correctly sort vector<string> foo. You can see a live example here: http://ideone.com/BVgyD2

EDIT:

Just saw your question update. You can use sort with string array[] as well. You'll just need to call it like this: sort(array, std::next(array, size), ...

Upvotes: 9

Patricio Rossi
Patricio Rossi

Reputation: 163

I use this lambda function to sort a vectors of strings:

std::sort(entries.begin(), entries.end(), [](const std::string& a, const std::string& b) -> bool {
        for (size_t c = 0; c < a.size() and c < b.size(); c++) {
            if (std::tolower(a[c]) != std::tolower(b[c]))
                return (std::tolower(a[c]) < std::tolower(b[c]));
        }
        return a.size() < b.size();
    });

Upvotes: 2

Norden
Norden

Reputation: 31

#include <algorithm>
#include <vector>
#include <string>

using namespace std;    

void CaseInsensitiveSort(vector<string>& strs)
{
    sort(
        begin(strs),
        end(strs),
        [](const string& str1, const string& str2){
            return lexicographical_compare(
                begin(str1), end(str1),
                begin(str2), end(str2),
                [](const char& char1, const char& char2) {
                    return tolower(char1) < tolower(char2);
                }
            );
        }
    );
}

Upvotes: 3

Slava
Slava

Reputation: 44248

This solution is much simpler to understand than Jonathan Mee's and pretty inefficient, but for educational purpose could be fine:

std::string lowercase( std::string s )
{
   std::transform( s.begin(), s.end(), s.begin(), ::tolower );
   return s;
}

std::sort( array, array + length, 
           []( const std::string &s1, const std::string &s2 ) { 
               return lowercase( s1 ) < lowercase( s2 ); 
           } );

if you have to use your sort function, you can use the same approach:

    ....
    minValue = lowercase( array[startScan] );

    for (int index = startScan + 1; index < size; index++) {
        const std::string &tstr = lowercase( array[index] );
        if (tstr < minValue) {
            minValue = tstr;
            minIndex = index;
        }
    }
    ...

Upvotes: 0

b.buchhold
b.buchhold

Reputation: 3896

You could call tolower on every character you compare. This is probably the easiest, yet not a great solution, becasue:

  • You look at every char multiple times so you'd call the method more often than necessary
  • You need extra care to handle wide-characters w.r.t to their encoding (UTF8 etc)

You could also replace the comparator by your own function. I.e. there will be some place where you compare something like stringone[i] < stringtwo[j] or charA < charB. change it to my_less_than(stringone[i], stringtwo[j]) and implement the exact ordering you want based.

another way would be to transform every string to lowercase once and create an array of pairs. then you base your comparisons on the lowercase value only, but you swap whole pairs so that your final strings will be in the right order as well.

finally, you can create an array with lowercase versions and sort this one. whenever you swap two elements in this one, you also swap in the original array.

note that all those proposals would still need proper handling of wide characters (if you need that at all)

Upvotes: 0

Related Questions