zoowalk
zoowalk

Reputation: 2134

extracting unique elements from nested list in dataframe

I have a data.frame with a variable which contains names of numerous participants. The names of the participants are all contained as one (=1) long string with names separated by a comma. Some of the names are repetitive. I try to get only each name once.

Below the data.

I converted the long string of names into a list:

b$s <- strsplit(b$participants, ",")

I then removed spaces on both sides of names to standardize them.

library(stringr)
b.l <- unlist(b$s)
b.l <- str_trim(b.l, side="both")

From this list I took the unique values

b.l <- unique(unlist(b.l))

The result are all unique names:

"Takfir wa'l Hijra" "AIS" "GIA"  "AQIM" "MUJAO" "FLEC-R" "FLEC-FAC"  

However, this list contains ALL unique names. I would like to perform these steps only for each ID (session number), which can be also repetitive.

I tried to perform the operation above with ddply but to no avail. Any recommendation? Unfortunately, I am not very familiar with the handling of lists.

Eventually, the dataframe should look like this:

id    unique.participants 
1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO 
1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO  
1-192 FLEC-R, FLEC-FAC 

Many thanks.

data.frame:

    b<–structure(list(id = structure(c(1L, 1L, 2L), .Label = c("1-191", 
    "1-192", "1-131"), class = "factor"), participants = c("Takfir wa'l Hijra,AIS,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM, MUJAO,AQIM", 
    "Takfir wa'l Hijra,AIS,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM, MUJAO,AQIM", 
    "FLEC-R,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC,FLEC-FAC"
    ), s = list(c("Takfir wa'l Hijra", "AIS", "AIS", " GIA", "AIS", 
    " GIA", "AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "GIA", "AQIM", 
    " GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM", 
    " GIA", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", 
    "AQIM", "AQIM", " MUJAO", "AQIM"), c("Takfir wa'l Hijra", "AIS", 
    "AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "AIS", 
    " GIA", "GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM", " GIA", 
    "AQIM", " GIA", "AQIM", " GIA", "AQIM", "AQIM", "AQIM", "AQIM", 
    "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", " MUJAO", "AQIM"), c("FLEC-R", 
    "FLEC-FAC", " FLEC-R", "FLEC-FAC", "FLEC-FAC", " FLEC-R", "FLEC-FAC", 
    "FLEC-FAC", " FLEC-R", "FLEC-FAC", "FLEC-FAC", "FLEC-FAC"))), .Names = c("id", 
    "participants", "s"), row.names = c(1L, 2L, 24L), class = "data.frame")

Upvotes: 0

Views: 936

Answers (3)

asltjoey
asltjoey

Reputation: 1

This should be a simpler way to get what you wanted, using data.table.

library(data.table)
b = data.table(b)[, unique_s := mapply(s, FUN = function(x) { unique(gsub(" ","",unlist(x))) } )]

#-- Output --#
b$unique_s
[[1]]
[1] "Takfirwa'lHijra" "AIS"             "GIA"             "AQIM"            
"MUJAO"          

[[2]]
[1] "Takfirwa'lHijra" "AIS"             "GIA"             "AQIM"            
"MUJAO"          

[[3]]
[1] "FLEC-R"   "FLEC-FAC"

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99331

within would be good for this. It allows for reassignment of the variables within the expression. Also, you could adjust your regular expression in strsplit so that you can remove those spaces and the commas in one go.

> within(b[-3],{
      unique.participants <- sapply(strsplit(participants, "(,)|(, )"), unique)
      rm(participants)
  })
#       id                      unique.participants
# 1  1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
# 2  1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
# 24 1-192                         FLEC-R, FLEC-FAC

Since I'm seeing

I would like to perform these steps only for each ID (session number), which can be also repetitive.

in your question, I'm sticking with the duplicate row.

Upvotes: 2

agstudy
agstudy

Reputation: 121568

Using ddply you can do this

library(plyr)
ddply(b,~id,summarise,
      nn= paste(unique(unlist(strsplit(participants,','))),collapse=','))

   id                                         nn
1 1-191 Takfir wa'l Hijra,AIS, GIA,GIA,AQIM, MUJAO
2 1-192                    FLEC-R,FLEC-FAC, FLEC-R

Upvotes: 3

Related Questions