shkra19
shkra19

Reputation: 309

Equality of two strings

What is the easiest way, with the least amount of code, to compare two strings, while ignoring the following:

"hello  world" == "hello world"                   // spaces
"hello-world"  == "hello world"                   // hyphens
"Hello World"  == "hello worlD"                   // case
"St pierre"    == "saint pierre" == "St. Pierre"  // word replacement

I'm sure this has been done before, and there are some libraries to do this kind of stuff, but I don't know any. This is in C++ preferably, but if there's a very short option in whatever other language, I'll want to hear about it too.

Alternatively, I'd also be interested in any library that could give a percentage of matching. Say, hello-world and hello wolrd are 97% likely to be the same meaning, just a hyphen and a mispelling.

Upvotes: 0

Views: 309

Answers (5)

user405725
user405725

Reputation:

  1. Remove spaces from both strings.
  2. Remove hyphens from both strings.
  3. Convert both strings to lower case.
  4. Convert all occurrences of “saint” and “st.” to “st”.
  5. Compare strings like normal.

For example:

#include <cctype>
#include <string>
#include <algorithm>
#include <iostream>

static void remove_spaces_and_hyphens(std::string &s)
{
    s.erase(std::remove_if(s.begin(), s.end(), [](char c) {
                return c == ' ' || c == '-';
            }), s.end());
}

static void convert_to_lower_case(std::string &s)
{
    for (auto &c : s)
        c = std::tolower(c);
}

static void
replace_word(std::string &s, const std::string &from, const std::string &to)
{
    size_t pos = 0;
    while ((pos = s.find(from, pos)) != std::string::npos) {
        s.replace(pos, from.size(), to);
        pos += to.size();
    }
}

static void replace_words(std::string &s)
{
    replace_word(s, "saint", "st");
    replace_word(s, "st.", "st");
}

int main()
{
    // Given two strings:
    std::string s1 = "Hello, Saint   Pierre!";
    std::string s2 = "hELlO,St.PiERRe!";

    // Remove spaces and hyphens.
    remove_spaces_and_hyphens(s1);
    remove_spaces_and_hyphens(s2);

    // Convert to lower case.
    convert_to_lower_case(s1);
    convert_to_lower_case(s2);

    // Replace words...
    replace_words(s1);
    replace_words(s2);

    // Compare.
    std::cout << (s1 == s2 ? "Equal" : "Doesn't look like equal") << std::endl;
}

There is a way, of course, to code this more efficiently, but I recommend you start with something working and optimize it only when it proves to be a bottleneck.

It also sounds like you might be interested in string similarity algorithms like “Levenshtein distance”. Similar algorithms are used, for example, by search engine or editors to offer suggestion on spell correction.

Upvotes: 2

Freeman Zhang
Freeman Zhang

Reputation: 175

for the first 3 requirments,

  1. remove all spaces/hypens of string (or replace it to a char, e.g'') "hello world" --> "helloworld"
  2. compare them ignore case. Case insensitive string comparison in C++

for the last requirment, it is more compliate.
first you need a dictionary, which in KV structure:
'St.': 'saint'
'Mr.': 'mister'

second use boost token to seperate the string, and fetch then in the KV Store
then replace the token to the string, but it may in low performance:

http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/tokenizer.htm

Upvotes: 0

Cameron Tinker
Cameron Tinker

Reputation: 9789

For spaces and hyphens, just replace all spaces/hyphens in the string and do a comparison. For case, convert all text to upper or lower case and do the comparison. For word replacement, you would need a dictionary of words with the key being the abbreviation and the value being the replacement word. You may also consider using the Levenshtein Distance algorithm for showing how similar one phrase is to another. If you want statistical probablility of how close a word/phrase is to another word/phrase, you will need sample data to do a comparison.

Upvotes: 1

Martin Perry
Martin Perry

Reputation: 9527

I dont know any library, but for equlity, if speed is not rpoblem, you can do char-by-char compare and ignore "special" characters (respectively move iterator further in text).

As for comparing texts, you can use simple Levenshtein distance.

Upvotes: 1

phyatt
phyatt

Reputation: 19152

QRegExp is what you are looking for. It won't print out the percentages, but you can make some pretty slick ways of comparing one string to another, and finding the number of matches of one string to another.

Regular Expressions are available with almost ever language out there. I like GSkinner's RegEx page for learning regular expressions.

http://qt-project.org/doc/qt-4.8/qregexp.html

Hope that helps.

Upvotes: 0

Related Questions