vengefulsealion
vengefulsealion

Reputation: 766

How can I strip dollar signs ($) from a data frame in R?

I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.

I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).

I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.

My data frame looks like this:

> str(data)
 'data.frame':  50 obs. of  17 variables:
 $ Year            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Prog.Cost       : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
 $ Total.Benefits  : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
 $ Net.Cash.Flow   : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
 $ Participant     : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
 $ Taxpayer        : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
 $ Others          : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
 $ Indirect        : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
 $ Crime           : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
 $ Child.Welfare   : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education       : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
 $ Health.Care     : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
 $ Welfare         : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
 $ Earnings        : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
 $ State.Benefits  : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
 $ Local.Benefits  : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
 $ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...

Upvotes: 2

Views: 7297

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:

setClass("dollar")
setAs("character", "dollar",
      function(from) 
        as.numeric(gsub("[,$]", "", from, fixed = FALSE)))

Before demonstrating how to use it, let's write @akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...

## write @akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
  Year=1:6, 
  Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE), 
  Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))

A <- tempfile()
write.csv(data, A, row.names = FALSE)

Now, you have a new option for colClasses that can be used with read.csv :-)

read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
#   Year Prog.Cost Total.Benefits
# 1    1    -33333           2155
# 2    2    -33333           2312
# 3    3         0           2312
# 4    4         0           2155
# 5    5         0           2418
# 6    6         0           2418

Upvotes: 3

Rich Scriven
Rich Scriven

Reputation: 99391

It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.

r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
#   Year Prog.Cost Total.Benefits
# 1    1   -3.3333          2.155
# 2    2   -3.3333          2.312
# 3    3    0.0000          2.312
# 4    4    0.0000          2.155
# 5    5    0.0000          2.418
# 6    6    0.0000          2.418

Upvotes: 2

akrun
akrun

Reputation: 887961

If you need to only remove the $ and do not want to change the class of the columns.

indx <- sapply(data, is.factor) 
data[indx] <- lapply(data[indx], function(x) 
                            as.factor(gsub("\\$", "", x)))

If you need numeric columns, you can strip out the , as well (contributed by @David Arenburg) and convert to numeric by as.numeric

data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))

You can wrap this in a function

f1 <- function(dat, pat="[$]", Class="factor"){
  indx <- sapply(dat, is.factor)
  if(Class=="factor"){
  dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
     }
  else {
  dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
   }
  dat
 }

 f1(data)
 f1(data, pat="[,$]", "numeric")

data

set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
          6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
         "$2,312"), 6, replace=TRUE))

Upvotes: 6

Related Questions