Reputation: 69
I have a CSV datafile called test_20171122
Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.
I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().
My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.
gsub("\\$", "", test_20171122)
The following code is a for loop that seems to get the job done.
for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1
}
I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:
test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))
Upvotes: 6
Views: 30015
Reputation: 565
A solution using the purrr
function map_df
:
clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))
Upvotes: 0
Reputation: 193657
You can use the following pattern in gsub
: "[$,]"
Example:
df <- data.frame(
V1 = c("$1,234.56", " $ 23,456.70"),
V2 = c("$89,101,124", "15,234")
)
df
# V1 V2
# 1 $1,234.56 $89,101,124
# 2 $ 23,456.70 15,234
df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
# V1 V2
# 1 1234.56 89101124
# 2 23456.70 15234
Upvotes: 9