Regular expressions, stringr - have the regex, can't get it to work in R

Question

I have a data frame with a field called "full.path.name" This contains things like s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx

01 GROUP is a pattern of variable size in the whole string.

I would like to add a new field onto the data frame called "short.path" and it would contain things like

s:///01 GROUP

s:///02 GROUP LONGER NAME

I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.

This gives me the file extension

sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))

I went to https://www.regextester.com/ and got this

 s:///*.[^/]*

as the regex to use so I tried it below

sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))

What I thought I would get is a new field on my data frame containing 01 GROUP etc I get NA

When I try this

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")

Gives me S

Where am I going wrong? When I use: https://regexr.com/ I get \d* [A-Z]* [A-Z]*[^/]

How do I put that into

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^/])

And make things work?

EDIT: There are two solutions here. The reason the solutions didn't work at first was because

  sfiles$Full.path.name

was >255 in some cases.

What I did: To make g_t_m's regex work

 library(tidyverse)
 #read the file
 sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)

 # add a field to calculate path length and filter out
 sfiles$file_path_length <- str_length(sfiles$Full.path.name)
 sfiles<-sfiles%>%filter(file_path_length <=255)

 # then use str_replace to take out the full path name and leave only the 
   top 
 # folder names

 sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, " 
 (^.+?/[^/]+?)/.+$", "\1"))
 levels(sfiles$file_path_short)

[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7

I think it was the full.path.name field that was causing problems. To make Wiktor's answer work I did this:

#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)       
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, " 
(^.+?/[^/]+?)/.+$", "\1")

Wiktor Stribiżew · Accepted Answer

You may use a mere

sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")

If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:

"(?<=^s:///)[^/]+"

See the regex demo

Details

^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).

Regular expressions, stringr - have the regex, can't get it to work in R

Answers (2)

Related Questions

Regular expressions, stringr - have the regex, can&#39;t get it to work in R

Answers (2)

Related Questions

Regular expressions, stringr - have the regex, can't get it to work in R