Simon
Simon

Reputation: 1111

Camel case string from all caps

I have a database with many thousands of tables and columns. The column names are consistently in all caps e.g. BOOKINGPROCESSNOTEADDED, BOOKEDDATETIME, BOOKINGSTATUS. I wish to rename columns so they are lower camel case e.g. BOOKINGSTATUS -> booking_status

Because there is no variation in case between words, spaces or underscores, it's essentially impossible to apply more traditional methods to convert strings into different cases (e.g. using R's snakecase package). I was wondering if it's possible to instead apply some sort of English language dictionary lookup on each string and return splits.

Taking the BOOKINGSTATUS example above, a return could be: boo_king_status, boo_king_stat_us and booking_status. Specifying the minimum length of words would be useful. If the minimum is set to 4 letters, then only booking_status would be returned in this example (because 'boo' is only 3-letters long and 'us' only 2-letters long)

It's quite possible that a brute-force method is too computationally expensive, but wanted to ask in case there is a reasonably efficient method to do this. A Python or R solution would be most welcome.

Upvotes: 3

Views: 593

Answers (3)

Tech Commodities
Tech Commodities

Reputation: 1959

library(dplyr)
library(stringr)

add_spaces <- function(colnames, words) {
  for(i in 1:length(colnames)) {
    for(j in words) {
      if(str_detect(string = colnames[i], pattern = j)) {
        colnames[i] <- str_replace(string = colnames[i], j, glue::glue("{str_to_lower(j)}_"))
      }
    }
  }
 
  colnames <- colnames %>% 
    str_remove("\\_+$") # Remove hyphens at the end
  
  message("Characters not identified: ")
  print(str_remove_all(colnames, "[a-z_]"))
  
  invisible(colnames)
}
   
colnames <- names(<file>) # Capture colnames  # Using c("BOOKINGPROCESSNOTEADDED", "BOOKEDDATETIME", "BOOKINGSTATUS")

words <- c("BOOKING", "BOOKED", "PROCESS") # Create first list of words

colnames <- add_spaces(colnames, words) # Run the first iteration        

> Characters not identified:
[1] "NOTEADDED" "DATETIME"  "STATUS"

words <- c(words, "NOTE", "ADDED", "DATE", "TIME", "STATUS") # Augment list with missing words

colnames <- add_spaces(colnames, words) # Rerun, ... repeat as needed
    
colnames 

[1] "booking_process_note_added" "booked_date_time"           "booking_status" 

Upvotes: 2

r2evans
r2evans

Reputation: 160447

Here's a sloppy, brute-force, imperfect attempt. It will almost certainly miss something. In fact, it's more a conversation about the process, hoping that you can build a better "dictionary".

First, a discussion about this "dictionary": ideally it should contain a word and its plural, *ing, and *ed forms. We'll be attempting to replace each word with a snake-wrapped (_word_) version, so we'll go in reverse-order based on length. For sanity, we should probably remove too-short words (and, an, a), so let's start with stringr::words (simply a "sample character vectors for practicing string manipulations", not a great start).

words <- stringr::words[ order(nchar(stringr::words), decreasing = TRUE) ]
# see words[nchar(words) < 4] for what we are removing here
words <- words[nchar(words) > 3]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words), init = vec)
# [1] "_BOOK_ING_PROCESS__NOTE_ADDED" "_BOOK_ED_DATE__TIME_"          "_BOOK_INGSTATUS"              

That looks odd, certainly. We can note that some of the words we know are in our vector are missing in stringr::words:

c("booking", "process", "status") %in% words
# [1] FALSE  TRUE FALSE

We can augment our list:

words2 <- c(words, "booking", "booked", "status")
words2 <- words2[ order(nchar(words2), decreasing = TRUE) ]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words2), init = vec)
# [1] "__BOOK_ING__PROCESS__NOTE_ADDED" "__BOOK_ED__DATE__TIME_"          "__BOOK_ING__STATUS_"            

The issue here is that since we have both "booking" and "book", it will always double-change "BOOKING". Given my naïve start here, I don't know that there's an easy quick-patch other than to remove "book" (and "king", incidentally).

words3 <- setdiff(words2, c("book", "king"))
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words3), init = vec)
# [1] "_BOOKING__PROCESS__NOTE_ADDED" "_BOOKED__DATE__TIME_"          "_BOOKING__STATUS_"            

From here, we can remove leading/trailing and double _.

gsub("__", "_",
     gsub("^_|_$", "", 
          Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
                 toupper(words3), init = vec)))
# [1] "BOOKING_PROCESS_NOTE_ADDED" "BOOKED_DATE_TIME"           "BOOKING_STATUS"            

The quality is completely dependent on starting with a good dictionary. If all of your UPPERNOSPACEWORDS are well defined, then perhaps you can build it manually. (Note that some words may just self-isolate because there is a known word both before and after (note that "added" is not in words3 but it is still broken out).

Upvotes: 2

user2554330
user2554330

Reputation: 44887

I would build the dictionary manually:

  • Start with an empty dictionary
  • Get all names
  • Find one containing uppercase
  • Manually add words to the dictionary to split that one
  • Split all of them using the current dictionary

Repeat the last 3 steps until all words are split. For example, with the 3 names you posted, the dictionary would first get c("booking", "status"), and that name would have no uppercase. The name BOOKINGPROCESSNOTEADDED would become booking_PROCESSNOTEADDED; if you chose that, you'd add c("process", "note", "added") to the dictionary, and find BOOKEDDATETIME next. Now you need to decide on the words: is it c("booked", "date", "time") or c("booked", "datetime")?

And so on.

Upvotes: 1

Related Questions