Reputation: 402
I have a dataframe which contains the names of supervisors and advisors of students' dissertations in a faculty as follows for example:
DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
I gonna separate supervisors and advisors as two distinct columns (as my expectation) like this:
DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi"))
DF1
Supervisor Advisors
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
I tried following codes:
DF1<-strsplit(DF$Names, "Name :")
stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) )
DF3 <- lapply(DF2,function(x) x[!x %in% stopwords] )
DF4<-lapply(DF3,function(x) paste(x, collapse = " "))
But the final results as follows is not what was my expectation and apparently need further work to be converted to a datataframe!:
DF4
[[1]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
[[2]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
[[3]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
Is there any simplified method to solve the problem? I found regexp can be helpful but I don't know how to use it atleast in the case of my example. Thanks in advance for any answer...
Upvotes: 1
Views: 96
Reputation: 5788
Messy Base R:
# Store a vector of names: ir_names => character vector
ir_names <- c("Name", "Family", "Type")
# Compute it's lenght: ir_name_len => string scalar
ir_name_len <- length(ir_names)
# Compute the desired result: res => data.frame
res <- do.call(
rbind,
lapply(
strsplit(
DF$Names,
"Name\\s+\\:\\s+"
),
function(x){
y <- data.frame(tmp = unlist(strsplit(x, " , ")))
ir1 <- setNames(
data.frame(
do.call(
rbind,
lapply(
split(
y,
ceiling(seq_len(nrow(y))/ir_name_len)
),
t
)
),
row.names = NULL,
stringsAsFactors = FALSE
),
ir_names
)
ir2 <- transform(
ir1,
Name = trimws(paste(Name, gsub("Family\\s+\\:\\s+", "", Family))),
Type = trimws(gsub("Type\\s+\\:\\s+", "", Type))
)[,c("Name", "Type")]
ir3 <- data.frame(
Supervisor = ir2$Name[which(grepl("supervisor", ir2$Type))],
Advisor = toString(ir2$Name[-which(grepl("supervisor", ir2$Type))]),
stringsAsFactors = FALSE,
row.names = NULL
)
}
)
)
# Print to console: data.frame => stdout(console)
res
Upvotes: 1
Reputation: 21400
Here's an attempt with extract
:
library(tidyr)
DF %>%
# clean strings:
mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>%
# extract data into columns:
extract(Names,
into = c("Supervisor", "Advisor"),
regex = "(\\w+\\s\\w+)\\s(.*)") %>%
# insert commas into `Advisor`:
mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE))
Supervisor Advisor
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
Explanation (as requested by OP):
The regular expression in extract
's regex
expression is designed to do two tasks:
Task (i) is achieved in that (\\w+\\s\\w+)
captures the two words that make up the Supvervisor
name, while \\s
describes (but does not capture) the following whitespace and (.*)
describes/matches anything that follows that whitespace - i.e., in this case the four Advisor
names.
Task (ii) is achieved by wrapping the Supvervisor
name and the Advisor
names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract
'realizes' that their content should go into the new columns.
The commas finally are inserted between the Advisor
names again using a capturing group, which can be recollected in gsub
's replacment argument using backreference (\\1
). The (?!$)
expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\b
is not (hence the !
in the lookahead) the end of the string (expressed in $
). Hope this helps!
Data:
DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
Upvotes: 5
Reputation: 76402
Here is a base R solution.
DF <- data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
stopwords <- c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
stoppattern <- paste(stopwords, collapse = "|")
DF1 <- strsplit(DF$Names, "Name :")
DF1 <- lapply(DF1, \(x) trimws(x[sapply(x, nchar) > 0L]))
DF2 <- lapply(DF1, \(x) {
gsub(stoppattern, "", x)
})
DF3 <- lapply(DF2, \(x) {
y <- gsub(stoppattern, "", x)
y <- strsplit(x, ",")
y <- lapply(y, trimws)
lapply(y, \(.y) {
.y <- trimws(.y)
.y[sapply(.y, nchar) > 0L]
})
})
DF4 <- lapply(DF3, \(x) {
Supervisor <- x[[1]][1:2]
Supervisor <- paste(trimws(Supervisor), collapse = " ")
Advisors <- unlist(x[-1])
Advisors <- paste(trimws(Advisors), collapse = ", ")
data.frame(Supervisor, Advisors)
})
Final <- do.call(rbind, DF4)
Final
#> Supervisor Advisors
#> 1 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 2 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 3 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
Created on 2022-06-05 by the reprex package (v2.0.1)
Upvotes: 2