Reputation: 15
I am trying to read a text file using read.table()
in R. R does not read anything that follows a #
. However, there are pound symbols in the text that have nothing to do with the comments. I want to delete the unwanted #
symbols without adding the comments to the data frame.
Fortunately, all of the pound symbols that I want to keep are in the first element of each row. So basically I need to delete all #
symbols that are not in the first element of the row.
2018-08-14 00:00:42 102.18.18.2
2018-08-15 00:00:47 223.45.67.8
2018-08-15 00:00:48 026.15.65.0
2018-08-15 00:00:49 924.43.47.0
2018-08-15 00:00:49 122.45.#67.9
I want to keep the pound symbol in the first line and delete the pound symbol in the last line that is causing problems in the data frame.
Upvotes: 1
Views: 825
Reputation: 109874
Here's a possible pure R solution:
First let's make your problem a full MWE (https://stackoverflow.com/help/mcve):
cat(
'#2018-08-14 00:00:42 102.18.18.2',
'2018-08-14 00:00:42 102.18.18.2',
'2018-08-15 00:00:47 223.45.67.8',
'2018-08-15 00:00:48 026.15.65.0',
'2018-08-15 00:00:49 924.43.47.0',
'2018-08-15 00:00:49 122.45.#67.9', sep = '\n', file = 'mytable.txt')
This creates a file in your working directory that we can read in.
(x <- readLines('mytable.txt'))
(y <- gsub('(?<!^)#', '', x, perl = TRUE))
read.table(text = y)
## V1 V2 V3
## 1 2018-08-14 00:00:42 102.18.18.2
## 2 2018-08-15 00:00:47 223.45.67.8
## 3 2018-08-15 00:00:48 026.15.65.0
## 4 2018-08-15 00:00:49 924.43.47.0
## 5 2018-08-15 00:00:49 122.45.67.9
I wrapped each line with ()
so you can see the out put. In real application I wouldn't include them.
The magic happens with gsub('(?<!^)#', '', x, perl = TRUE)
line. It uses a Negative Lookbehind (https://www.regular-expressions.info/lookaround.html) of (?<!^)#
and can be read as:
#
(any pound sign) BUT ?
(what) <
(comes before) !
(is not) ^
(the begininning of the line)Upvotes: 0
Reputation: 36
You can do it using a feature in Regular Expression knows as capture groups.
Just open your file in an editor which supports finding text using RegEx, such as VS Code.
In the Find box, write: (.+)(#)
In the replace box, write: $1
Clicking Replace all should remove all your # characters in between texts.
Alternatively, you could also write a script to do this.
Upvotes: 1