damo
damo

Reputation: 473

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name" This contains things like s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx

01 GROUP is a pattern of variable size in the whole string.

I would like to add a new field onto the data frame called "short.path" and it would contain things like

s:///01 GROUP

s:///02 GROUP LONGER NAME

I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.

This gives me the file extension

sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))

I went to https://www.regextester.com/ and got this

 s:///*.[^/]*

as the regex to use so I tried it below

sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))

What I thought I would get is a new field on my data frame containing 01 GROUP etc I get NA

When I try this

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")

Gives me S

Where am I going wrong? When I use: https://regexr.com/ I get \d* [A-Z]* [A-Z]*[^/]

How do I put that into

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])

And make things work?

EDIT: There are two solutions here. The reason the solutions didn't work at first was because

  sfiles$Full.path.name 

was >255 in some cases.

What I did: To make g_t_m's regex work

 library(tidyverse)
 #read the file
 sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)

 # add a field to calculate path length and filter out
 sfiles$file_path_length <- str_length(sfiles$Full.path.name)
 sfiles<-sfiles%>%filter(file_path_length <=255)

 # then use str_replace to take out the full path name and leave only the 
   top 
 # folder names

 sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, " 
 (^.+?/[^/]+?)/.+$", "\\1"))
 levels(sfiles$file_path_short)

[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7

I think it was the full.path.name field that was causing problems. To make Wiktor's answer work I did this:

#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)       
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, " 
(^.+?/[^/]+?)/.+$", "\\1")

Upvotes: 1

Views: 239

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use a mere

sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")

If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:

"(?<=^s:///)[^/]+"

See the regex demo

Details

  • ^ - start of string
  • s:/// - a literal substring
  • [^/]+ - a negated character class matching any 1+ chars other than /.
  • (?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).

Upvotes: 1

g_t_m
g_t_m

Reputation: 714

Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:

library(stringr)

df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
                                    "s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)

df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")

df$file_type
[1] "docx" "pdf" 

Then, the following code should give you your short name:

df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")

df
                                              full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx      docx   s:///01 GROUP
2  s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf       pdf   s:///01 GROUP

Upvotes: 1

Related Questions