amisos55
amisos55

Reputation: 1979

How to extract first string in R

I have a regex question.

I have a list of files below.

df <- c("Alilis CELF-4_CF_Data_Entry.xlsx" , "Ana T. CELF-4_CF_Data_Entry.xlsx" , "Ana V. CELF-4_CF_Data_Entry.xlsx","Anita CELF-4_CF_Data_Entry.xlsx")

[1] "Alilis CELF-4_CF_Data_Entry.xlsx" "Ana T. CELF-4_CF_Data_Entry.xlsx" "Ana V. CELF-4_CF_Data_Entry.xlsx" "Anita CELF-4_CF_Data_Entry.xlsx" 

I need to extract the name at the beginning of the string but there are a short letter with dot (e.g. Ana V.) I was not able to extract the letters.

With the code below,

unique(word(df, 1))
[1] "Alilis" "Ana"    "Anita" 

How can I get ?

[1] "Alilis" "Ana T."  "Ana V."  "Anita"

Upvotes: 2

Views: 88

Answers (5)

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

We can try sub with pattern (\\S+(\\s\\w\\.)?).*, i.e.,

> sub("(\\S+(\\s\\w\\.)?).*", "\\1", df)
[1] "Alilis" "Ana T." "Ana V." "Anita"

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

You can also use extract and positive lokahead:

library(tidyr)
data.frame(df) %>%
  extract(df,
          into = "Name",
          regex = "(.*?)\\sCELF")
    Name
1 Alilis
2 Ana T.
3 Ana V.
4  Anita

The regexpart captures anything that is before ...

  • \\s: one whitespace character followed by ...
  • CELF: ... the literal string CELF

Upvotes: 2

Quinten
Quinten

Reputation: 41235

Another option assuming you want everything before ' CELF' like this:

df <- c("Alilis CELF-4_CF_Data_Entry.xlsx" , "Ana T. CELF-4_CF_Data_Entry.xlsx" , "Ana V. CELF-4_CF_Data_Entry.xlsx","Anita CELF-4_CF_Data_Entry.xlsx")

library(stringr)
word(df,1,sep = "\\ CELF")
#> [1] "Alilis" "Ana T." "Ana V." "Anita"

Created on 2022-09-23 with reprex v2.0.2

Upvotes: 3

akrun
akrun

Reputation: 887048

Try with

gsub("^((\\S+)|^(\\w+ [A-Z]\\.))\\s+.*", "\\1", df)
[1] "Alilis" "Ana T." "Ana V." "Anita" 

Should also work if there are multiple spaces

> gsub("^((\\S+)|^(\\w+ [A-Z]\\.))\\s+.*", "\\1", c(df, "Allis hello CELF-4_Data_Entry.xlsx"))
[1] "Alilis" "Ana T." "Ana V." "Anita"  "Allis" 

Upvotes: 4

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6489

Try the following code:

sub("\\s+\\S+$", "", df)

[1] "Alilis" "Ana T." "Ana V." "Anita"

Upvotes: 3

Related Questions