Reputation: 2874
I have many many .txt
files that have some unwanted characters â and dotted about everywhere after having used regex to remove URLs and whitespace. I need to remove all of these from all the files.
These â were not present before cleaning the files, they are being produced as a result of the cleaning.
I found a regex that works for my text, and the URLs are being removed. First of all, my cleaning process (the commented out lines are other things I have tried):
clean_file <- sapply(curr_file, function(x) {
gsub("&", "&", x) %>%
gsub("http\\S+\\s*", "", .) %>%
gsub("[^[:alpha:][:space:]&']", "", .) %>%
#gsub("[^[:alnum:][:space:]\\'-]", "", .) %>%
stripWhitespace() %>%
gsub("^ ", "", .) %>%
gsub(" $", "", .)
#gsub("â", "", .)
})
Example input text (each line is a character string):
Gluskin’s Rosenberg: Don’t Bet on a Bear Market for Treasurys - Rising Treasury yields?... http://j.mp/UVM31t #FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market: Large investment asset losses can be… http://goo.gl/fb/cgzGv
Thank You http://pages.townhall.com/campaign/will-2013-be-a-bull-or-bear-market … via @townhallcom
Calif. GHG cap-and-trade: a bull or a bear market? http://bit.ly/VG9DTr
Unfortunately it doesn't appear here, but there are also some non-standard characters in the text above, namely \302
. R can see them just as that:
> x = _ <-- appears as an underscore in my text editor
Error: object '\302' not found
It may be that they come from shift+space
, as hinted here, however they are an artefact of my data and so I need to remove them - I cannot prevent them.
Output produced (visible in saved .txt
file):
Gluskinâs Rosenberg Donât Bet on a Bear Market for Treasurys - Rising Treasury yields FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can beâ
Thank You â via townhallcom
Calif GHG cap-and-trade a bull or a bear market
Output as visible in R console:
> head(clean_file)
..text
[1,] "Nice bear market rally for the Lakers NBA"
[2,] "Commented on StockTwits your scenario is entirely possible and as long as SPX doesn't exceed the bear market"
[3,] "Gluskin\342s Rosenberg Don\342t Bet on a Bear Market for Treasurys Rising Treasury yields FederalReserve"
[4,] "Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can be\342"
[5,] "Thank You \342 via townhallcom"
[6,] "Calif GHG capandtrade a bull or a bear market"
Before I thought of this as an encoding issue, simply replacing the â characters failed with this:
gsub("â", "", myText)
I have tried a few other solutions to changing the encoding of a the file (found in the solutions here)
I tried to write to file forcing the encoding of the output with fileEncoding = 'ascii'
instead of the default utf-8 (I believe), but the ascii simply gave me warnings and truncated many lines, leaving some completely empty. There also didn't seem to be any correlation between those lines removed and where the â character had previously appeared.
Can I try to prevent these characters from being created when writing in the future?
Upvotes: 0
Views: 1204
Reputation: 269481
This keeps only characters from hex 0 to hex 7f where Lines
is a character vector whose components are the lines of your file:
gsub("[^\\x{00}-\\x{7f}]", "", Lines, perl = TRUE)
Upvotes: 5