Reputation: 473
I have a data frame with a field called "full.path.name" This contains things like s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path" and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/ and got this
s:///*.[^/]*
as the regex to use so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing 01 GROUP etc I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong? When I use: https://regexr.com/ I get \d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT: There are two solutions here. The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did: To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems. To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
Upvotes: 1
Views: 239
Reputation: 626845
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:///
from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^
- start of strings:///
- a literal substring[^/]+
- a negated character class matching any 1+ chars other than /
.(?<=^s:///)
- a positive lookbehind that requires the presence of s:///
at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).Upvotes: 1
Reputation: 714
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP
Upvotes: 1