Reputation: 133
I am very much new to R programming. I am working a some data. The data is collected daily from a group of people. Usually, the format of the data is:
name, DOB, HF, LGA
in text format which populates the string vector
text <- c()
Here, the HF is linked to a database for each LGA(10 in total). That is, each LGA is a group of HFs
Interestingly, due to low level of compliance with the format, there are usually lots of errors in spelling of the HFs.
Here is a sample of the data
"first person Usman,03May2019,Ntade Health post,LGA1"
"second person, 7may2019,phc,makirin, LGA2"
#Here, "phc,makirin" is supposed to be spelt "Phc Makirine"
I have been able to extract the LGAs (since they are few) using R codes by some word match syntax covering the possible mistakes in spelling that is usually seen
#LGA vector
library(stringr)
LGA <- c()
LGA[str_detect(text_from_optin, regex("Alier|Aleiro|Alero", ignore_case = TRUE))] <- "ALIERO"
LGA[str_detect(text_from_optin, regex("Augie|Agie|Auge|Auggie?", ignore_case = TRUE))] <- "AUGIE"
LGA[str_detect(text_from_optin, regex("Bagudo", ignore_case = TRUE))] <- "BAGUDO"
LGA[str_detect(text_from_optin, regex("Bir?nin Kebb?i|BirninKebn?i|B\\Kebb?i|Binin|birninkebbi", ignore_case = TRUE))] <- "BIRNIN KEBBI"
LGA[str_detect(text_from_optin, regex("Dan?di", ignore_case = TRUE))] <- "DANDI"
LGA[str_detect(text_from_optin, regex("Danko?wasa|Wasagu|D\\Was|Dankowasagu|Danko", ignore_case = TRUE))] <- "DANKO WASAGU"
LGA[str_detect(text_from_optin, regex("Fakai", ignore_case = TRUE))] <- "FAKAI"
LGA[str_detect(text_from_optin, regex("Gw?andu", ignore_case = TRUE))] <- "GWANDU"
LGA[str_detect(text_from_optin, regex("Kalg", ignore_case = TRUE))] <- "KALGO"
LGA[str_detect(text_from_optin, regex("Koko Bes|K\\Bes|Kokobess?", ignore_case = TRUE))] <- "KOKO BESSE"
For the LGA, Aliero for example, there are about 200 HFs under their standard spellings
I am basically trying to populate the vector
Hf <- c()
with the appropriate word spelling of the HF with respect to the LGA
Is there a there syntax to say:
for each LGA group found in the text, scan if any HF(in the LGA group) matches. If it matches, then populate the vector Hf
Can someone please help me out. Thanks
Upvotes: 1
Views: 64
Reputation: 3923
Well I think you have an initial problem you need to solve first. The structure of your data is bound to cause problem if there are misplaced commas. I would solve that first by carefully breaking down these strings and identifying problem input like this...
library(dplyr)
library(tidyr)
yourdata <- read.csv("your textfile", header = FALSE)
yourdata
#> V1
#> 1 first person Usman,03May2019,Ntade Health post,LGA1
#> 2 second person, 7may2019,phc,makirin, LGA2
newyourdata <- tidyr::separate(data = yourdata,
col = V1,
sep = ",",
into = c("name", "DOB", "HF", "LGA", "problem"),
remove = FALSE,
extra = "merge",
fill = "right")
newyourdata %>% filter(!is.na(problem))
#> V1 name DOB HF LGA
#> 1 second person, 7may2019,phc,makirin, LGA2 second person 7may2019 phc makirin
#> problem
#> 1 LGA2
unique(newyourdata$HF)
#> [1] "Ntade Health post" "phc"
Reproducible data
yourdata <- structure(list(V1 = c("first person Usman,03May2019,Ntade Health post,LGA1",
"second person, 7may2019,phc,makirin, LGA2")), class = "data.frame", row.names = c(NA,
-2L))
Created on 2020-05-13 by the reprex package (v0.3.0)
Upvotes: 2