Reputation: 323
I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!
Upvotes: 1
Views: 287
Reputation: 521194
Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ-
or VQ-
, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA
into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.
Upvotes: 3
Reputation: 4024
library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)
Upvotes: 2