Xavier
Xavier

Reputation: 323

Difficulty Cleaning a Data Frame Using Regexp in R

I am relatively new to R and having difficulty cleaning up a data frame using regex.

One of the columns of that data frame has strings such as:

  NUMERO_APPEL
1          NNA
2     VQ-40989
3        41993
4        41993
5        42597
6     VQ-42597
7         DER8
8   40001-2010

I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.

AO-11111
VQ-11111
11111

** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.

Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)

This is the code that I currently have:

# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
 RightSubstring <- function(String, n) {
     substr(String, nchar(String)-n+1, nchar(String))
 }

# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
    CompleteVector <- complete.cases(DataFrame, Column)
    return(DataFrame[CompleteVector, ])

    Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
    Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
    Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)

    # Efface les Lignes avec des éléments NAs.
    Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
    Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
    Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
    Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
    Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}

Thanks in advance. Hope my explanations were clear!

Upvotes: 1

Views: 287

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521194

Here is a base R solution which will match 5 digits occurring only in the following three forms:

AO-11111
VQ-11111
11111

I use this regular expression to match the five digits:

^((AQ|VQ)-)?(\\d{5})$

Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.

The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.

ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA

For more reading see this SO post.

Upvotes: 3

bramtayl
bramtayl

Reputation: 4024

library(dplyr)
library(stringi)

df %>%
  mutate(NUMERO_APPEL.fix = 
           NUMERO_APPEL %>% 
             stri_extract_first_regex("[0-9]{5}") %>%
             as.numeric)

Upvotes: 2

Related Questions