Reputation: 402
I am working on a dataframe df
that has thousands of rows of junk data in which the first number is to be extracted despite irregular delimiter:
dummy_numbers = c("70210813000 70210862354",
"(234) 641-9690", "ren 23465726375 Finney",
"234 0225 7583 ALT 031 026 66542", "omega",
"(034) 319-6427", "(034)3263802", "(034)3128548",
"Mat: 030791272113 / 03040752983")
df <- data.frame(dummy_numbers)
> df
dummy_numbers
1 70210813000 70210862354
2 (234) 641-9690
3 ren 23465726375 Finney
4 234 0225 7583 ALT 031 026 66542
5 omega
6 (034) 319-6427
7 (034)3263802
8 (034)3128548
9 Mat: 030791272113 / 03040752983
Expected result is:
> df
dummy_numbers
1 70210813000
2 2346419690
3 23465726375
4 23402257583
5 NA
6 0343196427
7 0343263802
8 0343128548
9 030791272113
The idea of removing letters, punctuation, spaces, dash and all non-numeric did not yield desired result. I think it is due to inability to deal with irregular delimiters.
> df %>% dplyr::mutate(dummy_numbers = gsub("[- ./)(+]|[a-zA-Z]*:?","", dummy_numbers))
dummy_numbers
1 7021081300070210862354
2 2346419690
3 23465726375
4 2340225758303102666542
5
6 0343196427
7 0343263802
8 0343128548
9 03079127211303040752983
The idea of employing str_first_number()
function from the strex
package did not also yield desired result.
library(strex)
> df %>% dplyr::mutate(dummy_numbers = str_first_number(dummy_numbers))
dummy_numbers
1 70210813000
2 234
3 23465726375
4 234
5 NA
6 34
7 34
8 34
9 30791272113
Any help will be appreciated.
Upvotes: 0
Views: 56
Reputation: 19191
You can split the problem in two, based on the examples you have given.
First work on the strings with no alpha characters and no brackets or minus.
Then do the others. Not sure though if this holds true for all values you have in your data set. You may have to add more rules and use case_when
or similar.
library(stringr)
data.frame(dummy_numbers=
ifelse(!grepl("[[:alpha:]()-]", df$dummy_numbers),
str_extract(df$dummy_numbers, "[[:digit:]]+"),
trimws(str_extract(gsub("[ ()-]","",
df$dummy_numbers), "[[:digit:]]+"))))
dummy_numbers
1 70210813000
2 2346419690
3 23465726375
4 23402257583
5 <NA>
6 0343196427
7 0343263802
8 0343128548
9 030791272113
Upvotes: 3