Grace
Grace

Reputation: 201

How to extract postal code from a string of text into a new column, in R?

I have a dataframe of >10,000 rows. Column c is the column containing the full address in string, including the postal code. I would like to extract the postal code digits (6 digits) into a new column. All 6-digit postal codes come after the word, Singapore.

An example is as follows:

df <- c(a,b,c)

c <- c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902",...)

# need to extract 6-digit postal codes in c, into a new column, d

How do I extract the 6 digit postal codes into a new column, d?

Thank you!

Upvotes: 1

Views: 675

Answers (3)

TarJae
TarJae

Reputation: 78907

In case your data is organized throughout in this fashion with the postal code at the end then we could consider two more alternatives using stringr package. This will extract only the last word in the string:

library(stringr)
word(c,-1)

str_extract(c, '\\w+$')
[1] "248180" "150902"

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Use str_extract:

library(dplyr)
library(stringr)  
df %>%
    mutate(d = str_extract(c, "\\d{6}"))
   a  b                                                                c      d
1 NA NA   YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180 248180
2 NA NA MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902 150902

The regex pattern here is simply for any 6-digit string. If you have cases where such strings occur that are not postal codes you can refine the pattern using contextual information around the codes. For example it appears that the postal codes always occur at the end of the string. That end-of-string position can be targeted by the anchor $, like so: \\d{6}$

Data:

  df <- data.frame(
    a = NA,
    b = NA,
    c = c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902")
  )

Upvotes: 4

Mossa
Mossa

Reputation: 1709

Answer:

dummy <- c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902")
regmatches(dummy, regexpr("(\\d{6})", dummy))
[1] "248180" "150902"

Upvotes: 3

Related Questions