Reputation: 3196
I'm trying to use abbreviate
to come up with short unique abbreviations but its returning some unexpected values. If I run:
abbreviate(c('moscowcity', 'ms'), minlength = 2)
moscowcity ms
"msc" "ms"
it returns "mscw" instead of a simpler two-letter abbreviation such as "mo" or "mc" or "mt" or "my"
If I change to strict = TRUE
it returns duplicates.
Is there any way to get both as two letter abbreviations that are also unique?
Upvotes: 1
Views: 163
Reputation: 3196
Reading through the answers it looks like abbreviate
wasn't really built for what I wanted. Taking @akrun's suggestion I wrote a wrapper function to create unique abbreviations.
Improvements welcome!!
btrAbbreviate = function(x, maxlen) {
x = tolower(x)
res = abbreviate(x, minlength = maxlen, strict = TRUE)
dups = res[duplicated(res)]
dupsChk = length(dups)
while (dupsChk != 0) {
firstChar = stringr::str_sub(names(dups), 1, 1)
shfl = stringi::stri_rand_shuffle(substring(names(dups), 2))
shfl = paste0(firstChar, shfl)
out = stringr::str_sub(shfl, 1, 2)
names(out) = names(dups)
res[names(out)] = out
dups = res[duplicated(res)]
dupsChk = length(dups)
}
return(res)
}
x = state.name
btrAbbreviate(x, maxlen = 2)
Upvotes: 1
Reputation: 7592
From the documentation:
The default algorithm (method = "left.kept") used is similar to that of S. For a single string it works as follows. First spaces at the ends of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed followed by lower case consonants.
In other words, the algorithm starts by removing vowels (thus precluding mo
), then consonants, stopping once a duplicate is created. To achieve what you're suggesting, which is a very complicated thing to do (look into the history of US postal state name abbreviations!), you'll have to create your own algorithm.
Upvotes: 1
Reputation: 6483
I think the answer to your question
Is there any way to get both as two letter abbreviations that are also unique?
is: No, at least not with base R
abbreviate
From ?abbreviate
(emphasis is mine)**:
The default algorithm (method = "left.kept") used is similar to that of S. For a single string it works as follows. First spaces at the ends of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters and symbols are stripped.
Characters are always stripped from the end of the strings first. If an element of names.arg contains more than one word (words are separated by spaces) then at least one letter from each word will be retained.
As I understand it, this means that you would never get a string like mc
from moscowcity
because the c
will aready be stripped away when the algorythm tries ms
(which then is flagged as not unique and the last unique value is used -> msc
But: Because of the 'multiple word rule'
abbreviate(c("moscow city", "ms"), minlength = 2)
Returns:
moscow city ms
"mc" "ms"
Upvotes: 1
Reputation: 108
minlength is supposed to be an integer, not a boolean. minlength defaults to 4, and you have set it to TRUE. Thus, minlength is telling it to always have at least 4 characters. You would need to do minlength=2 instead.
(edit: the original question had minlength=TRUE but has been changed)
https://stat.ethz.ch/R-manual/R-devel/library/base/html/abbreviate.html
Upvotes: 0
Reputation: 887571
If we change the minlength
and wrap with make.unique
, it would prevent the duplicates
make.unique(abbreviate(c('moscowcity', 'ms'), minlength = 2, strict = TRUE))
#[1] "ms" "ms.1"
Here, the abbreviate
is applied on each of the elements separately and it is not doing any crosschecks whether it already allocated same abbreviation previously
Upvotes: 1