Brandon
Brandon

Reputation: 69

Using gsub() on a dataframe

I have a CSV datafile called test_20171122

image of dataset

Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.

I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().

My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.

gsub("\\$", "", test_20171122)

The following code is a for loop that seems to get the job done.

for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1 
}

I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:

test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))

Upvotes: 6

Views: 30015

Answers (2)

jayb
jayb

Reputation: 565

A solution using the purrr function map_df :

clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193657

You can use the following pattern in gsub: "[$,]"

Example:

df <- data.frame(
  V1 = c("$1,234.56", " $ 23,456.70"),
  V2 = c("$89,101,124", "15,234")
)
df
#             V1          V2
# 1    $1,234.56 $89,101,124
# 2  $ 23,456.70      15,234

df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
#         V1       V2
# 1  1234.56 89101124
# 2 23456.70    15234

Upvotes: 9

Related Questions