Reputation: 1468
Lets say I have a data frame (df) that contains the following data:
df = data.frame(name=c("David","Mark","Alice"),
income=c("5,000","10,00","$50.55"),
state=c("KS?","FL","CA;"))
I want to remove all punctuation from this data frame collectively. Of course, I could take each column as an individual vector and run a gsub command on it (see below), but I want to remove all punctuation in the whole data frame.
gsub("[?.;!¡¿·']", "", df$state)
Is there a way to specify this in R without writing a for loop or using an apply function to apply a function to each data frame column?
Upvotes: 1
Views: 4266
Reputation: 193517
Based on your criteria of "after importing", your condition of avoiding apply
and family seems really arbitrary. I'd be interested in your logic for that.
Anyway, here's an alternative for fixing the problem after you have already imported the data that honors your peculiar condition:
class
that can be use by colClasses
in read.table
and family.do.call(paste, ...)
to collapse your existing data.frame
to a tab-separated character vector.colClasses
this time.Here is the above as an example:
setClass("spc") ## Strip punctuation and return a character vector
setAs("character", "spc", function(from)
gsub("[[:punct:]]", "", from))
setClass("spn") ## Strip punctuation and return a numeric vector
setAs("character", "spn", function(from)
as.numeric(gsub("[[:punct:]]", "", from)))
## Use those `class`es in `colClasses`
out2 <- read.delim(text = do.call(paste, c(df, sep = "\t")),
header = FALSE, colClasses = c("spc", "spn", "spc"))
str(out2)
# 'data.frame': 3 obs. of 3 variables:
# $ V1: chr "David" "Mark" "Alice"
# $ V2: num 5000 1000 5055
# $ V3: chr "KS" "FL" "CA"
Alternatively, if any tabular form will suffice, you can convert the data to a matrix
and use gsub
on that.
gsub("[[:punct:]]", "", as.matrix(df))
# name income state
# [1,] "David" "5000" "KS"
# [2,] "Mark" "1000" "FL"
# [3,] "Alice" "5055" "CA"
Upvotes: 1
Reputation: 59970
Like @joran said, you can use sed
like to s
ubstitute out the punctuation you want to get rid of like this...
# Writing your data out to a file
write.table( df , "~/input.txt" , sep = "\t" )
# Reading it back in again, sans punctuation
read.table( pipe( paste0( "sed s'/[[:punct:]]//g' /Users/Simon/input.txt" ) ) , head=TRUE)
# name income state
#1 David 5000 KS
#2 Mark 1000 FL
#3 Alice 5055 CA
sed
processes your file line by line as it is being read into R. Using the [[:punct:]]
regexp class will ensure you really do remove all punctuation.
And it can be done entirely within R. Lovely.
Upvotes: 6