Reputation: 13
I want to remove punctuations, numbers and http links in text from data.frame file. I tried tm, stringr, quanteda, tidytext packages but none of them worked. I m looking for a useful basic package or function for clean data.frame file without convert it to corpus or something like that.
mycorpus <- tm_map(mycorpus, content_transformer(remove_url)) Warning message: In tm_map.SimpleCorpus(mycorpus, content_transformer(remove_url)) : transformation drops documents
mycorpus <- tm_map(mycorpus, removePunctuation) Warning message: In tm_map.SimpleCorpus(mycorpus, removePunctuation) : transformation drops documents
And, when I try to see some tweets which contains any symbol: Error in nchar(output) : invalid multibyte string, element 1
mycorpus <- tm_map(mycorpus, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input
Upvotes: 0
Views: 3580
Reputation: 2206
A concise version may be achieved if you aim at keeping only characters as follows by replacing everything that is not a character. Furthermore, I guess that you want to replace it by a blank because you mentioned something about corpus. Otherwise your addresses will be collapsed to noe long string (but maybe that is what you want - as stated you might provide an example).
x = c("https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r"
, "http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r")
gsub("\\W|\\d|http\\w?", " ", x, perl = T)
# [1] " stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r"
# [2] " stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r"
the same task for a data frame of 100000 rows
# make sure that your strings are not factors
df = data.frame(id = 1:1e5, url = rep(x, 1e5/2), stringsAsFactors = FALSE)
# df before replacement
df[1:4, ]
# id url
# 1 1 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 2 2 http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 3 3 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 4 4 http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# apply replacement on a specific column and assign result back to this column
df$url = gsub("\\W|\\d|http\\w?", " ", df$url, perl = T)
# check output
df[1:4, ]
# id url
# 1 1 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 2 2 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 3 3 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
# 4 4 stackoverflow com questions how can i remove punctuations and numbers in text from data frame file in r
Upvotes: 0
Reputation: 133518
Since you haven't posted any sample input or sample output so couldn't test it, for removing punctuation, digits and http links from your data frame's specific column you could try following once.
gsub("[[:punct:]]|[[:digit:]]|^http:\\/\\/.*|^https:\\/\\/.*","",df$column)
OR as per Rui's suggestion in comments use following too.
gsub("[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)","",df$column)
Upvotes: 4