12341234
12341234

Reputation: 404

R: Removing Whitespace + Delimiter

I'm fairly new to the R language. So I have this vector containing the following:

> head(sampleVector)

[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"

I want to extract the lines and break each into separate pieces, with a data value per piece. I want to get a list resultListthat eventually would print out the following:

> head(resultList)`

[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 

[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"

I am struggling with the strsplit() notation and I have tried and got the following code so far:

resultList  <- strsplit(sampleVector,"\\s+[|] | [|]\\s+ | [\\s+]")`          
#would give me the following output`

# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |" 

Anyway I can get the output the one strsplit call? I am guessing my notation to distinguish the delimiter + whitespace is wrong. Any help on this would be good.

Upvotes: 4

Views: 2669

Answers (3)

thelatemail
thelatemail

Reputation: 93813

Another strsplit option which I nearly missed:

strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

...and my original answer because regmatches is my favourite function of late:

regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"

To break it down as requested:

[| ]+ is a regex searching for single or repeated instances + of a space  or a pipe |
[^| ]+ is a regex searching for single or repeated instances + of any character not ^ a space  or a pipe |
gregexpr finds all the instances of this pattern and returns the start locations and length of the matching patterns.
regmatches extracts all the patterns from test that are matched by gregexpr

Upvotes: 4

Rich Scriven
Rich Scriven

Reputation: 99331

Here's one way. This first removes the | from the vector with gsub. Then it uses strsplit on the spaces (or any number of spaces). Probably a bit easier that way.

strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

Here's an interesting alternative using scan that might be useful, and will probably be quite fast.

lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

Upvotes: 4

rnso
rnso

Reputation: 24545

May try strsplit first and the gsub:

sapply(strsplit(xx, '\\|'), function (x) gsub("^\\s+|\\s+$", "", x))
     [,1]     
[1,] ""       
[2,] "txt01"  
[3,] "100"    
[4,] "200"    
[5,] "123.456"
[6,] "0.12345"

Upvotes: 0

Related Questions