Mnifldz
Mnifldz

Reputation: 155

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:

 Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
 Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
 Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
 Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
 Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
 Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")

What I want to do is the following:

  1. Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
  2. Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
  3. Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
  4. Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).

My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!

Upvotes: 0

Views: 1314

Answers (2)

eipi10
eipi10

Reputation: 93811

Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.

 seinfeld = lapply(1:nrow(Data), function(i) {

   # Turn strings into data frame with one name per row
   dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")), 
              Year=Data[i,"Year"],
              Name=unlist(strsplit(Data[i,"Name"], split="\n")))

   # Get rid of empty strings and NA values
   dat = dat[!(dat$Name %in% c(""," ","NA")), ]

   # Order alphabetically
   dat = dat[order(dat$Name), ]
 })

UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:

seinfeld = lapply(1:nrow(Data), function(i) {

  # Turn strings into data frame with one name per row
  dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")), 
                   Name=unlist(strsplit(Data[i,"Name"], split="\n")))

  # Get rid of empty strings and NA values
  dat = dat[!(dat$Name %in% c(""," ","NA")), ]

  # Order alphabetically
  dat = dat[order(dat$Name), ]

  # Collapse back into a single row with the new sort order
  dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
                   Year=Data[i, "Year"],
                   Name=paste(dat[, "Name"], collapse="\n"))

})

do.call(rbind, seinfeld)

           SerialNum Year                          Name
1      837\n983\n424 2015          Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010     George\nJohn\nPaul\nRingo
4      837\n983\n424 2015          Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer

Upvotes: 1

jazzurro
jazzurro

Reputation: 23574

eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))

library(data.table)
library(splitstackshape)
library(dplyr)

cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
       type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp

#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname  Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5  Name_1 Name_2
#(chr) (dbl)       (chr)       (chr)       (chr)       (chr)       (chr)   (chr)  (chr)
#1       1  2015         983         837         424          NA          NA Michael  Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)


lapply(temp, function(x){

    Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")

}) -> temp2

# A part of temp2
#$`1`
#$`1`$SerialNum
#   rowname Year .time_1 SerialNum
#1:       1 2015       1       983
#2:       1 2015       2       837
#3:       1 2015       3       424
#4:       1 2015       4        NA
#5:       1 2015       5        NA
#
#$`1`$Name
#   rowname Year .time_1    Name
#1:       1 2015       1 Michael
#2:       1 2015       2   Lewis
#3:       1 2015       3    Paul
#4:       1 2015       4      NA
#5:       1 2015       5      NA

lapply(1:nrow(mydf), function(x){

    merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))

}) %>%

lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out

print(out)

 #    rowname Year .time_1 SerialNum    Name
 # 1:       1 2015       2       837   Lewis
 # 2:       1 2015       1       983 Michael
 # 3:       1 2015       3       424    Paul
 # 4:       2 2014       1       123  Elaine
 # 5:       2 2014       3       789  George
 # 6:       2 2014       2       456   Jerry
 # 7:       2 2014       4       136  Kramer
 # 8:       3 2010       3       321  George
 # 9:       3 2010       1       987    John
 #10:       3 2010       2       654    Paul
 #11:       3 2010       4       975   Ringo
 #12:       4 2015       3       837   Lewis
 #13:       4 2015       2       983 Michael
 #14:       4 2015       1       424    Paul
 #15:       5 2014       3       123  Elaine
 #16:       5 2014       2       789  George
 #17:       5 2014       1       456   Jerry
 #18:       5 2014       4       136  Kramer

DATA

mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136", 
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ", 
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA", 
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum", 
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Upvotes: 1

Related Questions