Neuril
Neuril

Reputation: 183

R list to wide (sparse) data frame

My first time here so I hope I don't break anything... I have a list of lists:

Browse[2]> head(str(mylist))
List of 33
 $ : chr [1:33] "0001" "space" "28" "night_club" ...
 $ : chr [1:33] "0002" "concert" "28" "night_club" ...
 $ : chr [1:31] "0003" "night_club" "24" "martial_arts" ...
 $ : chr [1:31] "0004" "stage" "24" "basketball" ...
 $ : chr [1:43] "0005" "night_club" "16" "concert" ...
 $ : chr [1:43] "0006" "night_club" "16" "concert" ...
 $ : chr [1:39] "0007" "night_club" "22" "concert" ...
 $ : chr [1:39] "0008" "night_club" "22" "concert" ...
 $ : chr [1:31] "0009" "night_club" "46" "martial_arts" ...
 $ : chr [1:31] "0010" "night_club" "46" "martial_arts" ...
 $ : chr [1:41] "0011" "night_club" "17" "martial_arts" ...
 $ : chr [1:41] "0012" "night_club" "17" "martial_arts" ...
 $ : chr [1:29] "0013" "concert" "23" "night_club" ...
 $ : chr [1:29] "0014" "concert" "23" "night_club" ...
 $ : chr [1:25] "0015" "night_club" "26" "concert" ...
 $ : chr [1:31] "0016" "night_club" "42" "concert" ...
 $ : chr [1:31] "0017" "night_club" "42" "concert" ...
 $ : chr [1:31] "0018" "night_club" "25" "wrestling" ...
 $ : chr [1:31] "0019" "night_club" "25" "wrestling" ...
 $ : chr [1:33] "0020" "night_club" "46" "wrestling" ...
 $ : chr [1:33] "0021" "night_club" "46" "wrestling" ...
 $ : chr [1:41] "0022" "concert" "21" "stage" ...
 $ : chr [1:41] "0023" "concert" "21" "stage" ...
 $ : chr [1:55] "0024" "basketball" "8" "concert" ...
 $ : chr [1:55] "0025" "basketball" "8" "concert" ...
 $ : chr [1:37] "0026" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0027" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0028" "night_club" "32" "business_meeting" ...
 $ : chr [1:37] "0029" "night_club" "32" "business_meeting" ...
 $ : chr [1:15] "0030" "night_club" "59" "stage" ...
 $ : chr [1:37] "0031" "stage" "12" "night_club" ...
 $ : chr [1:37] "0032" "stage" "12" "night_club" ...
 $ : chr [1:33] "0033" "night_club" "23" "portrait" ...

I want to turn this list into a wide format data frame where the first column would be each of every inner list first element (i.e. "0001", "0002" etc.) and there would be all possible columns with categories exist in the file: "space", "night_club", "concert", "marital_arts", "wrestling" etc. meaning that I would a very wide data frame that each row will begin with some id (0001,0002,0003 ...) and the columns names would be again all categories in the file: "space", "night_club", "concert", "marital_arts", "wrestling" etc. and for each row, where the category exists for that id, it would populate the value next to the category from the list ("space" -> 28 from the first line for example).

I was trying to construct a normalized data frame with loops and then convert it to a wide format, but as data scales it would be a bad idea:

for (file in files){# iterate over files in folder

    mylist <- strsplit(readLines(file), ":")
    #close(mylist)
    for (elem in mylist){
      dataframe <- data.frame(frameid = numeric(), category = character(), nrow = length(unlist(elem)))
      frameid <- rep.int(elem[[1]], length(elem)-1) 
      categories <- elem[-1:-1]
      dataframe$frameid <- frameid
      dataframe$category <- categories
    }
  }

Reproducible input output example: dput of input:

 list(c("0001", "space", "28", "night_club", "25"), c("0002", 
"concert", "28", "night_club", "26"), c("0003", "night_club", 
"24", "martial_arts", "27"), c("0004", "stage", "24", "basketball", 
"30"))

output:

Dataframe
frameid, cat_space, cat_night_club, cat_concert, cat_martial_arts, cat_stage, cat_basketball
0001, 28, 25, 0, 0, 0, 0
0002, 0, 26, 28, 0, 0, 0
0003, 0, 24, 0, 27, 0, 0
0004, 0, 0, 0, 0, 24, 30

Upvotes: 2

Views: 422

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193627

Here's a possibility. I've created the answer as a function, and commented what is happening at each stage. The basic idea is to:

  1. Create a column of just the first items from each list element.
  2. Create a two-column matrix of the rest of the items. This assumes that the data are nicely paired.
  3. Create a data.frame of these two elements put together.
  4. Use xtabs to convert the output to a wide format. Note that if there are duplicated combinations of "ID" and "var", the values would be added together because of the use of xtabs.

Here's the function:

myFun <- function(inList) {
  ## Extract the first value in each list element
  ID <- vapply(inList, `[`, character(1L), 1)
  ## Convert the remaining elements into a two column matrix, first
  ##   column as variable, second column as value. Bind all list
  ##   elements together to a single 2-column mantrix.
  varval <- do.call(rbind, lapply(inList, function(x) {
    matrix(x[-1], ncol = 2, byrow = TRUE, dimnames = list(NULL, c("var", "val")))
  }))
  ## Create a data.frame where ID is repeated to the same number of rows
  ##   as the matrices found in varval.
  temp <- data.frame(ID = rep(ID, (lengths(inList)-1)/2), varval)
  ## Convert the val columns to numeric
  temp$val <- as.numeric(as.character(temp$val))
  ## Use xtabs to go from a "long" form to a "wide" form
  xtabs(val ~ ID + var, temp)
}

Here it is applied to your sample data (assuming your data is called "L"):

myFun(L)
#       var
# ID     basketball concert martial_arts night_club space stage
#   0001          0       0            0         25    28     0
#   0002          0      28            0         26     0     0
#   0003          0       0           27         24     0     0
#   0004         30       0            0          0     0    24

Upvotes: 1

Related Questions