Reputation: 271

Regular expressions to re-order strings in a field

I am trying to write a program with regular expressions to clean up some data. Let's say I have room names with a letter and a number. In the final output I need to output the room names using the pattern "the full string (excluding letter & number) + letter + number" as in the examples below. However, with the regular expressions I've written so far, I get very messed up results, which are at the bottom of my message. For some reason, it puts letters and characters on some of the rows, even though there may be none in the input data. Thank you.

EDITED: I made edits to the input data. I would like to generalize the code to take any number of character strings, not just the single word "ROOM".

# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2

# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
    " ATLANTA ROOM  3",
    "NEW YORK A ROOM   2",
    "4 ROOM A",
    "THE BIG AWESOME ROOM B",
    " ROOM 4 B",
    "GEORGETOWN B 2 ROOM ",
    " C NEW YORK ROOM 2",
    "NEW YORK ROOM C",
    "LOS ANGELES ROOM 2  E")

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

(dd2 <- paste(gsub("( +)", " ",
                   gsub("(^ +)|( +$)", "",
                        gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
              regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))

# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4", 
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3", 
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"

Upvotes: 4

Answers (3)

eddi

Reputation: 49448

Here's an attempt:

sub(' $', '', # clean up spaces at the end
    gsub(' +', ' ', # clean up double spaces
         # rearrange letter and numbers
         sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
             gsub(' |ROOM', '', dd)    # remove spaces and ROOM
            )
        )
   )
#[1] "ROOM"     "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"   "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C"   "ROOM E 2"

And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):

gsub('(^ | $)', '', # clean up spaces in front or end
     gsub(' +', ' ', # clean up double spaces
          # extract room name and put it in front of the letter and number
          paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
                sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
                    gsub(' |\\w\\w\\w+', '', dd)    # remove spaces and words
                   )
               )
         )
    )

Upvotes: 4

G. Grothendieck

Reputation: 269664

Try this:

library(gsubfn)

# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")

# put back together and sort
out <- sort(paste("ROOM", char, num))

# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))

> out
 [1] "ROOM"     "ROOM 2"   "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"  
 [7] "ROOM B 2" "ROOM B 4" "ROOM C"   "ROOM C 2"

UPDATE: minor improvements

Upvotes: 0

Hillary Sanders

Reputation: 6047

So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.

Here is a fix:

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)

letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)

output <- trim(paste("ROOM", letters, numbers))

[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"

Upvotes: 2

Regular expressions to re-order strings in a field

Answers (3)

Related Questions