Reputation: 750
I have been instructed to group a dataset of sales by salesperson name for work.
However, to reduce IT administration costs, the company does not create a separate ID or login for each salesperson, instead they have asked them to enter it manually in an app they are using. As you can imagine, this results in lots of duplicates due to data entry errors.
The dataset looks something like this (not their real names):
reprex_days <- seq(1,22,1)
reprex_sales <- rnorm(22, mean=100, sd=50)
reprex_names<- c("Abby Davies", "Abby Davies", "Abby Davies", "Abby Davis", "Abi Davies", "abby davies", NA,
"simon", "Simon Jenkins", "Simon Jenkins", "Simon Jenkins", "Simon Jenkins", "Simon jenjins",
"Toby Jones", "Toby Jones", "toby jones", "toby jones", "toby jonse", "toby", NA, NA, NA)
reprex <- data.frame(cbind(reprex_days, reprex_sales, reprex_names))
names(reprex) <-c("day_ID", "sales", "salesperson")
reprex$sales <- as.numeric(reprex$sales)
If I try to group by salesperson this obviously duplicates because of the multiple versions of each salesperson's name present in the data:
reprex_grouped <- reprex %>%
group_by(salesperson) %>%
summarise(total_sales = sum(sales),
mean_sales = mean(sales),
days_active = n())
reprex_grouped
gives:
# A tibble: 13 x 4
salesperson total_sales mean_sales days_active
<fct> <dbl> <dbl> <int>
1 Abby Davis 12 12 1
2 abby davies 14 14 1
3 Abby Davies 31 10.3 3
4 Abi Davies 3 3 1
5 simon 20 20 1
6 Simon Jenkins 5 5 1
7 Simon jenjins 13 13 1
8 Simon Jenkins 46 15.3 3
9 toby 18 18 1
10 toby jones 25 12.5 2
11 Toby Jones 17 8.5 2
12 toby jonse 6 6 1
13 NA 43 10.8 4
I'm planning to use str_remove() and toLower() to remove all the extra spaces and account for case differences, and I know that I'm not ever going to be able to successfully identify staff members when they have only entered their first names (sometimes there are multiple people with the same first name) or have failed to enter anything.
However I'm wondering if there's some way of automatically identifying where the name is one or two characters different from another entry and changing it to whichever comes first in the dataset (for this purpose it doesn't matter whether their name is spelled correctly as much as that as many as possible of their sales are attributed to a single person)? (e.g. "Toby Jonse" c.f. "Toby Jones", "Simon Jenjins" c.f. "Simon Jenkins", "Abby Davis" c.f. "Abby Davies")
Does anyone know of an advanced string pattern recognition package that can do something like this?
I can't do it manually because there are too many names, and because the company want to reuse the code for future sales activities with different staff. (I can't change the identification system; IT don't have the budget to hire someone to manage database permissions so they can't implement a traditional login system.)
Upvotes: 1
Views: 202
Reputation: 40171
It's not perfect, however, a nice option could be the phonetic()
function from stringdist
library:
reprex %>%
group_by(ID = phonetic(salesperson)) %>%
mutate(salesperson2 = first(salesperson))
day_ID sales salesperson ID salesperson2
<fct> <dbl> <fct> <chr> <fct>
1 1 10 Abby Davies A131 Abby Davies
2 2 13 Abby Davies A131 Abby Davies
3 3 14 Abby Davies A131 Abby Davies
4 4 19 Abby Davis A131 Abby Davies
5 5 22 Abi Davies A131 Abby Davies
6 6 7 abby davies A131 Abby Davies
7 7 18 <NA> <NA> <NA>
8 8 2 simon S550 simon
9 9 1 Simon Jenkins S552 Simon Jenkins
10 10 11 Simon Jenkins S552 Simon Jenkins
Upvotes: 1