Reputation: 735
I've gotten hold of some really messy data and I wrote a function to do some conversions (string to numeric), and I would love to improve it. Basically the function takes a vector of messy character data and converts the data to numeric.
for example:
## say you had this
df1 <- data.frame ( V1 = c(" $25.25", "4,828", " $7,253"), V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(vec){
vec <- str_trim(vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print("!!ERROR STRANGE CHARACTERS!!")
}
}
df1$V1recode <- numconv(df1$V1)
df1$V2recode <- numconv(df1$V2)
[1] "!!ERROR STRANGE CHARACTERS!!"
How do can I assign the name of the original column name within the function so I can paste it to the error message within the function, so it instead reads:
!!ERROR STRANGE CHARACTER IN V2!!
I've tried calling names() and colnames() within the function, but this doesn't seem to work.
Thanks in advance, C
Upvotes: 1
Views: 151
Reputation: 263471
The old deparse(substitute(.))
trick seems to work.
numconv <- function(vec){nam <- deparse(substitute(vec))
vec <- gsub(" ","", vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS!!", nam) )
}
}
df1$V2recode <- numconv(df1$V2)
# [1] "!!ERROR STRANGE CHARACTERS!! df1$V2"
(I didn't load stringr since I thought a gsub call would be more efficient.)
Upvotes: 2
Reputation: 3601
The key is to wrap the recoding up into the function as well. That way you can keep track of which columns you're working on and so get the column names to put in your warning message. The following function recodes whatever columns of a data frame are listed in the 'col_names' argument (if left null the function applies to all of them). The function returns the original data frame, plus the recoded columns with the string in flag
added to the column names.
require(stringr)
df1 <- data.frame (
V1 = c(" $25.25", "4,828", " $7,253"),
V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(df, col_names = NULL, flag = "recode"){
if(is.null(col_names)) {
col_names <- colnames(df)
}
out <- lapply(1:length(col_names), function(i) {
vec <- str_trim(df[,col_names[i]])
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS in", col_names[i], "!!"))
}
vec
})
out <- data.frame(out, stringsAsFactors = FALSE)
colnames(out) <- paste(col_names, flag, sep = "")
cbind(df, out)
}
numconv(df1)
[1] "!!ERROR STRANGE CHARACTERS in V2 !!"
V1 V2 V1recode V2recode
1 $25.25 THIS is bad data 25.25 THIS is bad data
2 4,828 725 4828.00 725
3 $7,253 *error 7253.00 *error
Upvotes: 1
Reputation: 60000
I feel this is a somewhat hacky way to do this, but you could use substitue
and then strsplit
on the $
, but this assumes you always call a column using its name with $
. Anyway, you can get the column name using this and paste it into an error message as you wish...
x <- strsplit(as.character( substitute(vec) ) ,"$" )[[3]]
Upvotes: 1