andrew_will
andrew_will

Reputation: 15

delete special characters from a text file after the first element in r

I am trying to read a text file using read.table() in R. R does not read anything that follows a #. However, there are pound symbols in the text that have nothing to do with the comments. I want to delete the unwanted # symbols without adding the comments to the data frame.

Fortunately, all of the pound symbols that I want to keep are in the first element of each row. So basically I need to delete all # symbols that are not in the first element of the row.

2018-08-14 00:00:42 102.18.18.2  
2018-08-15 00:00:47 223.45.67.8    
2018-08-15 00:00:48 026.15.65.0    
2018-08-15 00:00:49 924.43.47.0    
2018-08-15 00:00:49 122.45.#67.9

I want to keep the pound symbol in the first line and delete the pound symbol in the last line that is causing problems in the data frame.

Upvotes: 1

Views: 825

Answers (2)

Tyler Rinker
Tyler Rinker

Reputation: 109874

Here's a possible pure R solution:

MWE

First let's make your problem a full MWE (https://stackoverflow.com/help/mcve):

cat(
'#2018-08-14 00:00:42 102.18.18.2',
'2018-08-14 00:00:42 102.18.18.2',  
'2018-08-15 00:00:47 223.45.67.8',    
'2018-08-15 00:00:48 026.15.65.0',    
'2018-08-15 00:00:49 924.43.47.0',    
'2018-08-15 00:00:49 122.45.#67.9', sep = '\n', file = 'mytable.txt')

This creates a file in your working directory that we can read in.

Solution

(x <- readLines('mytable.txt')) 
(y <- gsub('(?<!^)#', '', x, perl = TRUE))
read.table(text = y)

##           V1       V2          V3
## 1 2018-08-14 00:00:42 102.18.18.2
## 2 2018-08-15 00:00:47 223.45.67.8
## 3 2018-08-15 00:00:48 026.15.65.0
## 4 2018-08-15 00:00:49 924.43.47.0
## 5 2018-08-15 00:00:49 122.45.67.9

I wrapped each line with () so you can see the out put. In real application I wouldn't include them.

The magic happens with gsub('(?<!^)#', '', x, perl = TRUE) line. It uses a Negative Lookbehind (https://www.regular-expressions.info/lookaround.html) of (?<!^)# and can be read as:

  • # (any pound sign) BUT
  • ? (what) < (comes before)
  • ! (is not)
  • ^ (the begininning of the line)

Upvotes: 0

Achilles
Achilles

Reputation: 36

You can do it using a feature in Regular Expression knows as capture groups.

Just open your file in an editor which supports finding text using RegEx, such as VS Code.

In the Find box, write: (.+)(#)

In the replace box, write: $1

Clicking Replace all should remove all your # characters in between texts.

Alternatively, you could also write a script to do this.

Upvotes: 1

Related Questions