stackuser
stackuser

Reputation: 133

String scan and match with respect to group in R

I am very much new to R programming. I am working a some data. The data is collected daily from a group of people. Usually, the format of the data is:

name, DOB, HF, LGA

in text format which populates the string vector

 text <- c()

Here, the HF is linked to a database for each LGA(10 in total). That is, each LGA is a group of HFs

Interestingly, due to low level of compliance with the format, there are usually lots of errors in spelling of the HFs.

Here is a sample of the data

 "first person Usman,03May2019,Ntade Health post,LGA1"
 "second person, 7may2019,phc,makirin, LGA2"

#Here, "phc,makirin" is supposed to be spelt "Phc Makirine"

I have been able to extract the LGAs (since they are few) using R codes by some word match syntax covering the possible mistakes in spelling that is usually seen

#LGA vector
library(stringr)
LGA <- c()
LGA[str_detect(text_from_optin, regex("Alier|Aleiro|Alero", ignore_case = TRUE))] <- "ALIERO"
LGA[str_detect(text_from_optin, regex("Augie|Agie|Auge|Auggie?", ignore_case = TRUE))] <- "AUGIE"
LGA[str_detect(text_from_optin, regex("Bagudo", ignore_case = TRUE))] <- "BAGUDO"
LGA[str_detect(text_from_optin, regex("Bir?nin Kebb?i|BirninKebn?i|B\\Kebb?i|Binin|birninkebbi", ignore_case = TRUE))] <- "BIRNIN KEBBI"
LGA[str_detect(text_from_optin, regex("Dan?di", ignore_case = TRUE))] <- "DANDI"
LGA[str_detect(text_from_optin, regex("Danko?wasa|Wasagu|D\\Was|Dankowasagu|Danko", ignore_case = TRUE))] <- "DANKO WASAGU"
LGA[str_detect(text_from_optin, regex("Fakai", ignore_case = TRUE))] <- "FAKAI"
LGA[str_detect(text_from_optin, regex("Gw?andu", ignore_case = TRUE))] <- "GWANDU"
LGA[str_detect(text_from_optin, regex("Kalg", ignore_case = TRUE))] <- "KALGO"
LGA[str_detect(text_from_optin, regex("Koko Bes|K\\Bes|Kokobess?", ignore_case = TRUE))] <- "KOKO BESSE"

For the LGA, Aliero for example, there are about 200 HFs under their standard spellings

I am basically trying to populate the vector

Hf <- c()

with the appropriate word spelling of the HF with respect to the LGA

Is there a there syntax to say:

for each LGA group found in the text, scan if any HF(in the LGA group) matches. If it matches, then populate the vector Hf

Can someone please help me out. Thanks

Upvotes: 1

Views: 64

Answers (1)

Chuck P
Chuck P

Reputation: 3923

Well I think you have an initial problem you need to solve first. The structure of your data is bound to cause problem if there are misplaced commas. I would solve that first by carefully breaking down these strings and identifying problem input like this...

library(dplyr)
library(tidyr)

yourdata <- read.csv("your textfile", header = FALSE)

yourdata
#>                                                    V1
#> 1 first person Usman,03May2019,Ntade Health post,LGA1
#> 2           second person, 7may2019,phc,makirin, LGA2

newyourdata <- tidyr::separate(data = yourdata, 
                               col = V1, 
                               sep = ",", 
                               into = c("name", "DOB", "HF", "LGA", "problem"), 
                               remove = FALSE, 
                               extra = "merge", 
                               fill = "right")

newyourdata %>% filter(!is.na(problem))

#>                                          V1          name       DOB  HF     LGA
#> 1 second person, 7may2019,phc,makirin, LGA2 second person  7may2019 phc makirin
#>   problem
#> 1    LGA2

unique(newyourdata$HF)
#> [1] "Ntade Health post" "phc"

Reproducible data

yourdata <- structure(list(V1 = c("first person Usman,03May2019,Ntade Health post,LGA1", 
                                  "second person, 7may2019,phc,makirin, LGA2")), class = "data.frame", row.names = c(NA, 
                                                                                                                     -2L))

Created on 2020-05-13 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions