Mahdi Hadi
Mahdi Hadi

Reputation: 402

How to extract specific words from a string with pattern in R

I have a dataframe which contains the names of supervisors and advisors of students' dissertations in a faculty as follows for example:

 DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

I gonna separate supervisors and advisors as two distinct columns (as my expectation) like this:

DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi"))

DF1
  Supervisor                                             Advisors
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

I tried following codes:

DF1<-strsplit(DF$Names, "Name :")

stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")

DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) )

DF3 <- lapply(DF2,function(x)  x[!x %in% stopwords] )

DF4<-lapply(DF3,function(x)  paste(x, collapse = " "))

But the final results as follows is not what was my expectation and apparently need further work to be converted to a datataframe!:

DF4
[[1]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[2]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[3]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

Is there any simplified method to solve the problem? I found regexp can be helpful but I don't know how to use it atleast in the case of my example. Thanks in advance for any answer...

Upvotes: 1

Views: 96

Answers (3)

hello_friend
hello_friend

Reputation: 5788

Messy Base R:

# Store a vector of names: ir_names => character vector
ir_names <- c("Name", "Family", "Type")

# Compute it's lenght: ir_name_len => string scalar
ir_name_len <- length(ir_names)

# Compute the desired result: res => data.frame
res <- do.call(
  rbind, 
  lapply(
    strsplit(
      DF$Names,
      "Name\\s+\\:\\s+"
    ),
    function(x){
      y <- data.frame(tmp = unlist(strsplit(x, " , ")))
      ir1 <- setNames(
        data.frame(
          do.call(
            rbind, 
            lapply(
              split(
                y, 
                ceiling(seq_len(nrow(y))/ir_name_len)
              ), 
              t
            )
          ),
          row.names = NULL,
          stringsAsFactors = FALSE
        ),
        ir_names
      )
      ir2 <- transform(
        ir1,
        Name = trimws(paste(Name, gsub("Family\\s+\\:\\s+", "", Family))),
        Type = trimws(gsub("Type\\s+\\:\\s+", "", Type))
      )[,c("Name", "Type")]
      ir3 <- data.frame(
        Supervisor = ir2$Name[which(grepl("supervisor", ir2$Type))],
        Advisor = toString(ir2$Name[-which(grepl("supervisor", ir2$Type))]),
        stringsAsFactors = FALSE,
        row.names = NULL
      )
    }
  )
)
# Print to console: data.frame => stdout(console)
res

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Here's an attempt with extract:

library(tidyr)
DF %>%
  # clean strings:
  mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>%
  # extract data into columns:
  extract(Names,
          into = c("Supervisor", "Advisor"),
          regex = "(\\w+\\s\\w+)\\s(.*)") %>%
  # insert commas into `Advisor`:
  mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE))
  Supervisor                                              Advisor
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

  • (i) it must describe the string as a whole, from beginning to end
  • (ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

Upvotes: 5

Rui Barradas
Rui Barradas

Reputation: 76402

Here is a base R solution.

DF <- data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

stopwords <- c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
stoppattern <- paste(stopwords, collapse = "|")

DF1 <- strsplit(DF$Names, "Name :")
DF1 <- lapply(DF1, \(x) trimws(x[sapply(x, nchar) > 0L]))

DF2 <- lapply(DF1, \(x) {
  gsub(stoppattern, "", x)
})

DF3 <- lapply(DF2, \(x) {
  y <- gsub(stoppattern, "", x)
  y <- strsplit(x, ",")
  y <- lapply(y, trimws)
  lapply(y, \(.y) {
    .y <- trimws(.y)
    .y[sapply(.y, nchar) > 0L]
  })
})

DF4 <- lapply(DF3, \(x) {
  Supervisor <- x[[1]][1:2]
  Supervisor <- paste(trimws(Supervisor), collapse = " ")
  Advisors <- unlist(x[-1])
  Advisors <- paste(trimws(Advisors), collapse = ", ")
  data.frame(Supervisor, Advisors)
})

Final <- do.call(rbind, DF4)
Final
#>   Supervisor                                                 Advisors
#> 1 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 2 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 3 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi

Created on 2022-06-05 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions