Reputation: 1979
I have a regex question.
I have a list of files below.
df <- c("Alilis CELF-4_CF_Data_Entry.xlsx" , "Ana T. CELF-4_CF_Data_Entry.xlsx" , "Ana V. CELF-4_CF_Data_Entry.xlsx","Anita CELF-4_CF_Data_Entry.xlsx")
[1] "Alilis CELF-4_CF_Data_Entry.xlsx" "Ana T. CELF-4_CF_Data_Entry.xlsx" "Ana V. CELF-4_CF_Data_Entry.xlsx" "Anita CELF-4_CF_Data_Entry.xlsx"
I need to extract the name at the beginning of the string but there are a short letter with dot (e.g. Ana V.
) I was not able to extract the letters.
With the code below,
unique(word(df, 1))
[1] "Alilis" "Ana" "Anita"
How can I get ?
[1] "Alilis" "Ana T." "Ana V." "Anita"
Upvotes: 2
Views: 88
Reputation: 101247
We can try sub
with pattern (\\S+(\\s\\w\\.)?).*
, i.e.,
> sub("(\\S+(\\s\\w\\.)?).*", "\\1", df)
[1] "Alilis" "Ana T." "Ana V." "Anita"
Upvotes: 2
Reputation: 21400
You can also use extract
and positive lokahead:
library(tidyr)
data.frame(df) %>%
extract(df,
into = "Name",
regex = "(.*?)\\sCELF")
Name
1 Alilis
2 Ana T.
3 Ana V.
4 Anita
The regex
part captures anything that is before ...
\\s
: one whitespace character followed by ...CELF
: ... the literal string CELF
Upvotes: 2
Reputation: 41235
Another option assuming you want everything before ' CELF' like this:
df <- c("Alilis CELF-4_CF_Data_Entry.xlsx" , "Ana T. CELF-4_CF_Data_Entry.xlsx" , "Ana V. CELF-4_CF_Data_Entry.xlsx","Anita CELF-4_CF_Data_Entry.xlsx")
library(stringr)
word(df,1,sep = "\\ CELF")
#> [1] "Alilis" "Ana T." "Ana V." "Anita"
Created on 2022-09-23 with reprex v2.0.2
Upvotes: 3
Reputation: 887048
Try with
gsub("^((\\S+)|^(\\w+ [A-Z]\\.))\\s+.*", "\\1", df)
[1] "Alilis" "Ana T." "Ana V." "Anita"
Should also work if there are multiple spaces
> gsub("^((\\S+)|^(\\w+ [A-Z]\\.))\\s+.*", "\\1", c(df, "Allis hello CELF-4_Data_Entry.xlsx"))
[1] "Alilis" "Ana T." "Ana V." "Anita" "Allis"
Upvotes: 4
Reputation: 6489
Try the following code:
sub("\\s+\\S+$", "", df)
[1] "Alilis" "Ana T." "Ana V." "Anita"
Upvotes: 3