nielsvg
nielsvg

Reputation: 25

Only select last part of a string after the last point

I have a dataframe with one column that represents the requests made by my users. A few examples look like this:

GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif

I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:

.html
.gif
.html

.gif

I tried doing this with sub but I only manage to select everything after the first "." example:

tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")


sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))

this returns:

".html"                     
"./enviro/gif/emcilogo.gif" 
".txt.html"                 
""                          
".gif"  

I created the following gsub code that works for string containing two points:

gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")

this returns:

"gif"

However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."

I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!

Upvotes: 1

Views: 105

Answers (2)

s_baldur
s_baldur

Reputation: 33488

Here is a regex-free solution:

sapply(
  seq_along(a),
  function(i) {
    if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
  }
)

# [1] "html" "gif"  "html" ""     "gif" 

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.

string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)

Or

sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)

See the R online demo. Both return

[1] ".html" ".gif"  ".html" ""      ".gif"

Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.

See Regex 1 and Regex 2 demos.

Upvotes: 2

Related Questions