R Cleaning and reordering names/serial numbers in data frame

Question

Let's say I have a data frame as follows in R:

 Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
 Data[1,] <- c("983
837
424
 ", 2015, "Michael
Lewis
Paul
 ")
 Data[2,] <- c("123
456
789
136", 2014, "Elaine
Jerry
George
Kramer")
 Data[3,] <- c("987
654
321
975
 ", 2010, "John
Paul
George
Ringo
NA")
 Data[4,] <- c("424
983
837", 2015, "Paul
Michael
Lewis")
 Data[5,] <- c("456
789
123
136", 2014, "Jerry
George
Elaine
Kramer")

What I want to do is the following:

Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "... ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = " ")).

My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = " ") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!

eipi10 · Accepted Answer

Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.

 seinfeld = lapply(1:nrow(Data), function(i) {

   # Turn strings into data frame with one name per row
   dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="
")), 
              Year=Data[i,"Year"],
              Name=unlist(strsplit(Data[i,"Name"], split="
")))

   # Get rid of empty strings and NA values
   dat = dat[!(dat$Name %in% c(""," ","NA")), ]

   # Order alphabetically
   dat = dat[order(dat$Name), ]
 })

UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:

seinfeld = lapply(1:nrow(Data), function(i) {

  # Turn strings into data frame with one name per row
  dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="
")), 
                   Name=unlist(strsplit(Data[i,"Name"], split="
")))

  # Get rid of empty strings and NA values
  dat = dat[!(dat$Name %in% c(""," ","NA")), ]

  # Order alphabetically
  dat = dat[order(dat$Name), ]

  # Collapse back into a single row with the new sort order
  dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="
"),
                   Year=Data[i, "Year"],
                   Name=paste(dat[, "Name"], collapse="
"))

})

do.call(rbind, seinfeld)

           SerialNum Year                          Name
1      837
983
424 2015          Lewis
Michael
Paul
2 123
789
456
136 2014 Elaine
George
Jerry
Kramer
3 321
987
654
975 2010     George
John
Paul
Ringo
4      837
983
424 2015          Lewis
Michael
Paul
5 123
789
456
136 2014 Elaine
George
Jerry
Kramer

R Cleaning and reordering names/serial numbers in data frame

Answers (2)

Related Questions