Reputation: 3196

string abbreviation creating dublicates

I'm trying to use abbreviate to come up with short unique abbreviations but its returning some unexpected values. If I run:

abbreviate(c('moscowcity', 'ms'), minlength = 2)
moscowcity         ms 
    "msc"       "ms"

it returns "mscw" instead of a simpler two-letter abbreviation such as "mo" or "mc" or "mt" or "my"

If I change to strict = TRUE it returns duplicates.

Is there any way to get both as two letter abbreviations that are also unique?

Upvotes: 1

Answers (5)

Rafael

Reputation: 3196

Reading through the answers it looks like abbreviate wasn't really built for what I wanted. Taking @akrun's suggestion I wrote a wrapper function to create unique abbreviations.

Improvements welcome!!

btrAbbreviate = function(x, maxlen) {


  x = tolower(x)

  res = abbreviate(x, minlength = maxlen, strict = TRUE)
  dups = res[duplicated(res)]
  dupsChk = length(dups)

  while (dupsChk != 0) {
    firstChar = stringr::str_sub(names(dups), 1, 1)

    shfl = stringi::stri_rand_shuffle(substring(names(dups), 2))
    shfl = paste0(firstChar, shfl)

    out = stringr::str_sub(shfl, 1, 2)
    names(out) = names(dups)

    res[names(out)] = out

    dups = res[duplicated(res)]
    dupsChk = length(dups)
  }

  return(res)
}

x = state.name
btrAbbreviate(x, maxlen = 2)

Upvotes: 1

iod

Reputation: 7592

From the documentation:

The default algorithm (method = "left.kept") used is similar to that of S. For a single string it works as follows. First spaces at the ends of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed followed by lower case consonants.

In other words, the algorithm starts by removing vowels (thus precluding mo), then consonants, stopping once a duplicate is created. To achieve what you're suggesting, which is a very complicated thing to do (look into the history of US postal state name abbreviations!), you'll have to create your own algorithm.

Upvotes: 1

dario

Reputation: 6483

I think the answer to your question

Is there any way to get both as two letter abbreviations that are also unique?

is: No, at least not with base R abbreviate

From ?abbreviate (emphasis is mine)**:

The default algorithm (method = "left.kept") used is similar to that of S. For a single string it works as follows. First spaces at the ends of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters and symbols are stripped.

Characters are always stripped from the end of the strings first. If an element of names.arg contains more than one word (words are separated by spaces) then at least one letter from each word will be retained.

As I understand it, this means that you would never get a string like mc from moscowcity because the c will aready be stripped away when the algorythm tries ms (which then is flagged as not unique and the last unique value is used -> msc

Edit:

But: Because of the 'multiple word rule'

abbreviate(c("moscow city", "ms"), minlength = 2)

Returns:

moscow city          ms 
       "mc"        "ms"

Upvotes: 1

nickjaybird

Reputation: 108

minlength is supposed to be an integer, not a boolean. minlength defaults to 4, and you have set it to TRUE. Thus, minlength is telling it to always have at least 4 characters. You would need to do minlength=2 instead.

(edit: the original question had minlength=TRUE but has been changed)

https://stat.ethz.ch/R-manual/R-devel/library/base/html/abbreviate.html

Upvotes: 0

akrun

Reputation: 887571

If we change the minlength and wrap with make.unique, it would prevent the duplicates

make.unique(abbreviate(c('moscowcity', 'ms'), minlength = 2, strict = TRUE))
#[1] "ms"   "ms.1"

Here, the abbreviate is applied on each of the elements separately and it is not doing any crosschecks whether it already allocated same abbreviation previously

Upvotes: 1

string abbreviation creating dublicates

Answers (5)

Edit:

Related Questions