pauljohn32
pauljohn32

Reputation: 2255

gsub replace and preserve case

I've been using gsub to abbreviate words in longer strings. I'd like to abbreviate a word and then inherit as much of the capitalization of the input as I can.

Example, turn hello to hi in this:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

But respect the case of hello in the original

c("Hi World", "HI WORLD", "hi world", "hI world")

Most of the examples I really want to match are "HI" "hi" and "Hi". I don't care so much about "hI", but for completeness, I leave that as a possibility.

To get this done until now, I have the tedious approach of maintaining vectors of strings of targets and replacements

xin <- c("Hello\ ", "HELLO\ ", "hello\ ", "hElLo\ ")
xout <- c("Hi ", "HI ", "hi ", "hI ")
mapply(gsub, xin, xout, x)

That gives a correct answer, see:

     Hello      HELLO      hello      hElLo
"Hi World" "HI WORLD" "hi world" "hI world"

But this is embarrassing and time consuming and inflexible! So far, I have a family of 50 words for which we seek abbreviation, and keeping all of the case combinations is tiresome.

The data is full of mixed-case data chaos because humans typed in about 78000 records and they capitalized words like department and university in every conceivable way. The long sentences they typed don't fit in the space allowed on the printed page, and we are asked to shorten them to "dept" and "univ". We want to preserve the capitalization if possible.

The only idea I have looks not much like R to me. Split the original input, tabulate the existing capitalization for the first 2 letters.

xcap <- sapply(strsplit(x, split = ""), function(x) x %in% LETTERS)[1:2, ]
> t(xcap)
      [,1]  [,2]
[1,]  TRUE FALSE
[2,]  TRUE  TRUE
[3,] FALSE FALSE
[4,] FALSE  TRUE

I'm pretty sure I could use that capitalization information to make this work right. But I haven't yet succeeded. I've just become aware of G Grothendieck's package gsubfn which might work, but the terminology there ("proto" objects) is new to me.

I'll keep going in that direction, probably, but am asking now if there is a more direct route.

pj

Upvotes: 2

Views: 415

Answers (2)

pauljohn32
pauljohn32

Reputation: 2255

I tried to post this as comment on above, but exceed the word limit. OK to start new answer?

Here's the solution we are using. This takes the idea that @vck proposed and wraps it in some functions that clear up input and output. This still feels a bit kludgey to me, but the top priority was getting something that works in a way we can understand. The gsubfn based avenues were not.

##' abbreviate words within strings, but preserve case of input
##'
##' Problem described at
##' http://stackoverflow.com/questions/32304688/gsub-replace-and-preserve-case
##' Please notify me of examples that fail
##' @param y vector of target words to be abbreviated
##' @param old replacements for target words.  must match old
##' @param new replacements for target words.  must match old
##' vector length.
##' @return vector of abbreviated words 
##' @author Paul Johnson <pauljohn@@ku.edu>
stabbr <- function(y = NULL, old = NULL, new = NULL){
    stopifnot(length(old) == length(new))
    transfwrap <- function(xxin, xxout, xx){
        sapply(xx, transf, xin = xxin, xout = xxout)
    }

    transf <- function(x, xin, xout) {
        xin <- tolower(xin)
        xcap <- (unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
        n <- nchar(xout)
        if(length(xcap) >= n) {
            xcap<-xcap[1:n]
        } else {
            xcap <- c(xcap, rep(tail(xcap,1), n-length(xcap)))
        }
        xout2 <- paste(sapply(1:n,function(x) {
            if (xcap[x]) toupper(unlist(strsplit(xout,""))[x])
            else unlist(strsplit(xout,""))[x]
        }), sep = "", collapse = "")
        gsub(xin, xout2, x[1], ignore.case = T)
    }

    for (i in seq_along(old)){
        y <- transfwrap(old[i], new[i], y)
    }
    y
}

Example usages:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")
xin <- c("Hello", "world")
xout <- c("hi", "wrld")
stabbr(x, xin, xout)

## Hello World HELLO WORLD hello world hElLo world 
##   "Hi Wrld"   "HI WRLD"   "hi wrld"   "hI wRLD" 
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith")
xin <- c("Department", "Ornithology")
xout <- c("Dept", "Orni")
res <- stabbr(x, xin, xout)
cbind(x, res)

##                      x                           res             
##Department of Ornithology "Department of Ornithology" "Dept of Orni"  
## DEPARTMENT of ORNITHOLOGY "DEPARTMENT of ORNITHOLOGY" "DEPT of ORNI"  
## Dept of Ornith            "Dept of Ornith"            "Dept of Ornith"

## Tolerates regular expressions.
## Suppose you want to change Department only at first word?
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith", "Ornithology Department")
## Aiming here for Department only as first word
xin <- c("^Department", " Ornithology")
xout <- c("Dept", " Orni")
res <- stabbr(x, xin, xout)
res

There is a nice side effect of this approach. The output is a named vector that uses the input names.

##    Department of Ornithology DEPARTMENT of ORNITHOLOGY  
##           "Dept of Orni"            "DEPT of ORNI" 
##
##           Dept of Ornith    Ornithology Department 
##          "Dept of Ornith"  "Ornithology Department" 

Upvotes: 0

vck
vck

Reputation: 837

Your idea inspired me to write this code. Its done in one sapply block. toupper function is used to capitalize splitted characters of xout string.

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

sapply(x, function(x,xout) {
  xcap<-(unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
  n<-nchar(xout)
  if(length(xcap)>=n) {
   xcap<-xcap[1:n]
  }else {
    xcap<-c(xcap,rep(tail(xcap,1),n-length(xcap)))
    }
  xout<-paste(sapply(1:n,function(x) {
    if(xcap[x]) toupper(unlist(strsplit(xout,""))[x])
    else unlist(strsplit(xout,""))[x]
    }),sep = "",collapse = "")
  xin<-"hello"
  gsub(xin,xout,x[1],ignore.case = T)
  },xout="selamlar")

[output with "selamlar"]
 Hello World      HELLO WORLD      hello world      hElLo world 
"Selamlar World" "SELAMLAR WORLD" "selamlar world" "sElAmlar world" 

[output with "hi"]
Hello World HELLO WORLD hello world hElLo world 
"Hi World"  "HI WORLD"  "hi world"  "hI world" 

Upvotes: 2

Related Questions