myloginid
myloginid

Reputation: 1473

R - Regex to Remove Last Word from String

I have data like below -

PLAYSTORE BANGKOK
FLOAT@THE BAY          SINGAPORE
YANTRA                 SINGAPORE
AIRASIA_QS9DQQL        SINGAPORE

I want to remove the last word from each string, if it is in list of cities that i am looking for using this -

sub('(?i)^(.*)\\b(singapore|stockholm|singapor|bangkok|kuala lumpur|london|tokyo)$','\\2', merch_desc$desc2 )

But \1 or \2 dont work and i get the full string again. Is there a way to correct this?

I want 2 outputs - 1 with the company names and another with the locations into 2 separate vectors.

merch_desc$merch -

  PLAYSTORE 
    FLOAT@THE BAY          
    YANTRA                 
    AIRASIA_QS9DQQL      

merch_desc$loc -

BANGKOK
SINGAPORE
SINGAPORE
SINGAPORE

It seems strange that it works on string but not on data frames -

test$desc2
[1] "qoo10                  singapore    " "bill payment via internet banking"    "mcdonald's restaurants singapore    "
[4] "hdb season parking     singapore    " "grabtaxi pte ltd       singapore    "

This does not work -

sub('^.* (singapore|stockholm|singapor|bangkok|kuala lumpur|london|tokyo)$', '\\1', test$desc2 )
[1] "qoo10                  singapore    " "bill payment via internet banking"    "mcdonald's restaurants singapore    "
[4] "hdb season parking     singapore    " "grabtaxi pte ltd       singapore    "

But this works -

sub('^.* (singapore|stockholm|singapor|bangkok|kuala lumpur|london|tokyo)$', '\\1', 'tigerair y843km singapore' )
[1] "singapore"

Edit 2 -

Use trimws(). Without Trimws it does not handle the multiple spaces.

Thanks, Manish

Upvotes: 3

Views: 1408

Answers (1)

akrun
akrun

Reputation: 887951

We can capture the substring as groups using sub in pattern, then we add a delimiter (,) between the capture groups in the replacement, use that as sep in the read.table. If there are leading/lagging spaces, remove it by str_trim from stringr by looping through the columns.

library(stringr)
d1 <- read.table(text=sub('(.*)\\s+(\\S+)$', '\\1,\\2', v1),sep=',')
d1[] <- lapply(d1, str_trim)
d1
#              V1        V2
#1       PLAYSTORE   BANGKOK
#2   FLOAT@THE BAY SINGAPORE
#3          YANTRA SINGAPORE
#4 AIRASIA_QS9DQQL SINGAPORE

Or as suggested by @RichardScriven, a base R option for trimming leading/lagging spaces is trimws.

d1[] <- lapply(d1, trimws)

data

v1 <- c('PLAYSTORE BANGKOK','FLOAT@THE BAY          SINGAPORE',
       'YANTRA                 SINGAPORE',
        'AIRASIA_QS9DQQL        SINGAPORE')

Upvotes: 3

Related Questions