Reputation: 2134
I have a data.frame with a variable which contains names of numerous participants. The names of the participants are all contained as one (=1) long string with names separated by a comma. Some of the names are repetitive. I try to get only each name once.
Below the data.
I converted the long string of names into a list:
b$s <- strsplit(b$participants, ",")
I then removed spaces on both sides of names to standardize them.
library(stringr)
b.l <- unlist(b$s)
b.l <- str_trim(b.l, side="both")
From this list I took the unique values
b.l <- unique(unlist(b.l))
The result are all unique names:
"Takfir wa'l Hijra" "AIS" "GIA" "AQIM" "MUJAO" "FLEC-R" "FLEC-FAC"
However, this list contains ALL unique names. I would like to perform these steps only for each ID (session number), which can be also repetitive.
I tried to perform the operation above with ddply but to no avail. Any recommendation? Unfortunately, I am not very familiar with the handling of lists.
Eventually, the dataframe should look like this:
id unique.participants
1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
1-192 FLEC-R, FLEC-FAC
Many thanks.
data.frame:
b<–structure(list(id = structure(c(1L, 1L, 2L), .Label = c("1-191",
"1-192", "1-131"), class = "factor"), participants = c("Takfir wa'l Hijra,AIS,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM, MUJAO,AQIM",
"Takfir wa'l Hijra,AIS,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,AIS, GIA,GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM, GIA,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM,AQIM, MUJAO,AQIM",
"FLEC-R,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC, FLEC-R,FLEC-FAC,FLEC-FAC,FLEC-FAC"
), s = list(c("Takfir wa'l Hijra", "AIS", "AIS", " GIA", "AIS",
" GIA", "AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "GIA", "AQIM",
" GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM",
" GIA", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM", "AQIM",
"AQIM", "AQIM", " MUJAO", "AQIM"), c("Takfir wa'l Hijra", "AIS",
"AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "AIS", " GIA", "AIS",
" GIA", "GIA", "AQIM", " GIA", "AQIM", " GIA", "AQIM", " GIA",
"AQIM", " GIA", "AQIM", " GIA", "AQIM", "AQIM", "AQIM", "AQIM",
"AQIM", "AQIM", "AQIM", "AQIM", "AQIM", " MUJAO", "AQIM"), c("FLEC-R",
"FLEC-FAC", " FLEC-R", "FLEC-FAC", "FLEC-FAC", " FLEC-R", "FLEC-FAC",
"FLEC-FAC", " FLEC-R", "FLEC-FAC", "FLEC-FAC", "FLEC-FAC"))), .Names = c("id",
"participants", "s"), row.names = c(1L, 2L, 24L), class = "data.frame")
Upvotes: 0
Views: 936
Reputation: 1
This should be a simpler way to get what you wanted, using data.table
.
library(data.table)
b = data.table(b)[, unique_s := mapply(s, FUN = function(x) { unique(gsub(" ","",unlist(x))) } )]
#-- Output --#
b$unique_s
[[1]]
[1] "Takfirwa'lHijra" "AIS" "GIA" "AQIM"
"MUJAO"
[[2]]
[1] "Takfirwa'lHijra" "AIS" "GIA" "AQIM"
"MUJAO"
[[3]]
[1] "FLEC-R" "FLEC-FAC"
Upvotes: 0
Reputation: 99331
within
would be good for this. It allows for reassignment of the variables within the expression. Also, you could adjust your regular expression in strsplit
so that you can remove those spaces and the commas in one go.
> within(b[-3],{
unique.participants <- sapply(strsplit(participants, "(,)|(, )"), unique)
rm(participants)
})
# id unique.participants
# 1 1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
# 2 1-191 Takfir wa'l Hijra, AIS, GIA, AQIM, MUJAO
# 24 1-192 FLEC-R, FLEC-FAC
Since I'm seeing
I would like to perform these steps only for each ID (session number), which can be also repetitive.
in your question, I'm sticking with the duplicate row.
Upvotes: 2
Reputation: 121568
Using ddply
you can do this
library(plyr)
ddply(b,~id,summarise,
nn= paste(unique(unlist(strsplit(participants,','))),collapse=','))
id nn
1 1-191 Takfir wa'l Hijra,AIS, GIA,GIA,AQIM, MUJAO
2 1-192 FLEC-R,FLEC-FAC, FLEC-R
Upvotes: 3