ATMA
ATMA

Reputation: 1468

Remove all punctuation from a csv after importing it

Lets say I have a data frame (df) that contains the following data:

df = data.frame(name=c("David","Mark","Alice"),
income=c("5,000","10,00","$50.55"),
state=c("KS?","FL","CA;"))

I want to remove all punctuation from this data frame collectively. Of course, I could take each column as an individual vector and run a gsub command on it (see below), but I want to remove all punctuation in the whole data frame.

gsub("[?.;!¡¿·']", "", df$state)

Is there a way to specify this in R without writing a for loop or using an apply function to apply a function to each data frame column?

Upvotes: 1

Views: 4266

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Based on your criteria of "after importing", your condition of avoiding apply and family seems really arbitrary. I'd be interested in your logic for that.

Anyway, here's an alternative for fixing the problem after you have already imported the data that honors your peculiar condition:

  • Create a new class that can be use by colClasses in read.table and family.
  • Use do.call(paste, ...) to collapse your existing data.frame to a tab-separated character vector.
  • Re-read that character vector, specifying colClasses this time.

Here is the above as an example:

setClass("spc")           ## Strip punctuation and return a character vector
setAs("character", "spc", function(from) 
  gsub("[[:punct:]]", "", from))
setClass("spn")           ## Strip punctuation and return a numeric vector
setAs("character", "spn", function(from) 
  as.numeric(gsub("[[:punct:]]", "", from)))

## Use those `class`es in `colClasses`
out2 <- read.delim(text = do.call(paste, c(df, sep = "\t")), 
                   header = FALSE, colClasses = c("spc", "spn", "spc"))
str(out2)
# 'data.frame':  3 obs. of  3 variables:
#  $ V1: chr  "David" "Mark" "Alice"
#  $ V2: num  5000 1000 5055
#  $ V3: chr  "KS" "FL" "CA"

Alternatively, if any tabular form will suffice, you can convert the data to a matrix and use gsub on that.

gsub("[[:punct:]]", "", as.matrix(df))
#      name    income state
# [1,] "David" "5000" "KS" 
# [2,] "Mark"  "1000" "FL" 
# [3,] "Alice" "5055" "CA" 

Upvotes: 1

Simon O&#39;Hanlon
Simon O&#39;Hanlon

Reputation: 59970

Like @joran said, you can use sed like to substitute out the punctuation you want to get rid of like this...

#  Writing your data out to a file
write.table( df , "~/input.txt" , sep = "\t" )

#  Reading it back in again, sans punctuation
read.table( pipe( paste0( "sed s'/[[:punct:]]//g' /Users/Simon/input.txt" ) ) , head=TRUE)
#   name income state
#1 David   5000    KS
#2  Mark   1000    FL
#3 Alice   5055    CA

sed processes your file line by line as it is being read into R. Using the [[:punct:]] regexp class will ensure you really do remove all punctuation.

And it can be done entirely within R. Lovely.

Upvotes: 6

Related Questions