Josh Van Vianen
Josh Van Vianen

Reputation: 77

writing a function in R that selects the a string based on the first instance of a letter and replaces the string

I have a data frame with several variables like this:

land_unit<-c("0.5ha", "hactares", "ha", "ha", "acre", "3ha", 
              "lima", "limas", "acre", "cunny", "6 cunnies")

I want to write a function that will tidy this data for me as i have many variables in my data frame with a similar format. I would like the function to replace each element based on the first letter that appears in the string. For example if the first letter to appear in the string is "h" I want the whole string replaced by "ha", if "l" then "lima", if "a" then "acre" and if "c" then "kani".

I have searched widely on this but cannot find an answer, however I am aware that there must be a relatively simple solution. Perhaps using regex?

Any help would be greatly appreciated.

Upvotes: 2

Views: 51

Answers (2)

Sandipan Dey
Sandipan Dey

Reputation: 23101

This should also work (with making the lookup table hard-coded and decoupling the data from code):

land_unit<-c("0.5ha", "hactares", "ha", "ha", "acre", "3ha", 
             "lima", "limas", "acre", "cunny", "6 cunnies")

library(stringr)
# define a lookup table, decouple the data
lookup_table <- data.frame(first.letter=c('h', 'l', 'a', 'c'), 
                           replace.str=c('ha', 'lima', 'acre', 'kani'), 
                           stringsAsFactors = FALSE) 
# extract the matches
matches <- match(str_match(land_unit, "[^[:alpha:]]*([:alpha:]).*")[,2] , lookup_table[,1]) 
# replace from lookup table
ifelse(!is.na(matches), lookup_table[matches,2], land_unit) 
# [1] "ha"   "ha"   "ha"   "ha"   "acre" "ha"   "lima" "lima" "acre" "kani" "kani"

Upvotes: 1

akrun
akrun

Reputation: 887028

Based on the description, may be this helps. We use gsubfn to match zero or more characters that are not a letter ([^A-Za-z]*) from the start of the string (^) followed by a single letter captured as a group (([a-z])) followed by other characters (.*) and replace the capture group by a named key/value list

library(gsubfn)
gsubfn("^[^A-Za-z]*([a-z]).*", list(h = "ha", l="lima", a = "acre", c = "kani"), land_unit)
#[1] "ha"   "ha"   "ha"   "ha"   "acre" "ha"   "lima" "lima" "acre" "kani" "kani"

Upvotes: 1

Related Questions