n1k31t4
n1k31t4

Reputation: 2874

Removing strings with regex producing special character: â

Short version:

I have many many .txt files that have some unwanted characters â and dotted about everywhere after having used regex to remove URLs and whitespace. I need to remove all of these from all the files.

These â were not present before cleaning the files, they are being produced as a result of the cleaning.

Long version

I found a regex that works for my text, and the URLs are being removed. First of all, my cleaning process (the commented out lines are other things I have tried):

clean_file <-  sapply(curr_file, function(x) {
    gsub("&amp;", "&", x) %>%
        gsub("http\\S+\\s*", "", .) %>%
        gsub("[^[:alpha:][:space:]&']", "", .) %>%
        #gsub("[^[:alnum:][:space:]\\'-]", "", .) %>%
        stripWhitespace() %>%
        gsub("^ ", "", .) %>%
        gsub(" $", "", .)
        #gsub("â", "", .)
})

Example input text (each line is a character string):

Gluskin’s Rosenberg: Don’t Bet on a Bear Market for Treasurys -  Rising Treasury yields?... http://j.mp/UVM31t   #FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market: Large investment asset losses can be… http://goo.gl/fb/cgzGv 
Thank You http://pages.townhall.com/campaign/will-2013-be-a-bull-or-bear-market …  via @townhallcom
Calif. GHG cap-and-trade: a bull or a bear market? http://bit.ly/VG9DTr 

Unfortunately it doesn't appear here, but there are also some non-standard characters in the text above, namely \302. R can see them just as that:

> x = _                                   <-- appears as an underscore in my text editor
Error: object '\302' not found

It may be that they come from shift+space, as hinted here, however they are an artefact of my data and so I need to remove them - I cannot prevent them.

Output produced (visible in saved .txt file):

Gluskinâs Rosenberg Donât Bet on a Bear Market for Treasurys - Rising Treasury yields FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can beâ
Thank You â via townhallcom
Calif GHG cap-and-trade a bull or a bear market

Output as visible in R console:

> head(clean_file)
      ..text                                                                                                        
[1,] "Nice bear market rally for the Lakers NBA"                                                                    
[2,] "Commented on StockTwits your scenario is entirely possible and as long as SPX doesn't exceed the bear market" 
[3,] "Gluskin\342s Rosenberg Don\342t Bet on a Bear Market for Treasurys Rising Treasury yields FederalReserve"           
[4,] "Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can be\342"
[5,] "Thank You \342 via townhallcom"
[6,] "Calif GHG capandtrade a bull or a bear market"

Before I thought of this as an encoding issue, simply replacing the â characters failed with this:

gsub("â", "", myText)

I have tried a few other solutions to changing the encoding of a the file (found in the solutions here) I tried to write to file forcing the encoding of the output with fileEncoding = 'ascii' instead of the default utf-8 (I believe), but the ascii simply gave me warnings and truncated many lines, leaving some completely empty. There also didn't seem to be any correlation between those lines removed and where the â character had previously appeared.

Can I try to prevent these characters from being created when writing in the future?

Upvotes: 0

Views: 1204

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269481

This keeps only characters from hex 0 to hex 7f where Lines is a character vector whose components are the lines of your file:

gsub("[^\\x{00}-\\x{7f}]", "", Lines, perl = TRUE)

Upvotes: 5

Related Questions