Reputation: 149
I have data that looks like this:
DocID Impact
CCRB-9-569 114;Adaptation - Strategic
CCRB-9-531 173;Nutrient trading
CCRB-9-886
CCRB-9-989
CCRB-9-530 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671 106;Adaptation Responses;98;Climate Change
CCRB-9-570 114;Adaptation - Strategic
CCRB-9-990
CCRB-9-526 98;Climate Change
Ideally I want to end up with:
DocID Impact
CCRB-9-569 Adaptation - Strategic
CCRB-9-531 Nutrient trading
CCRB-9-886
CCRB-9-989
CCRB-9-530 Change in Temperature
CCRB-9-530 Extreme weather events
CCRB-9-530 Lower Rainfall
CCRB-9-671 Adaptation Responses
CCRB-9-671 Climate Change
CCRB-9-570 Adaptation - Strategic
CCRB-9-990
CCRB-9-526 Climate Change
I started out trying
test1=lapply(unlist(strsplit(test$Impact,"\\;")),as.character)
but then don't have the ability to link back to DocID and don't get any spaces for the rows with no input. I've played around with leaving out the unlist, trying to relist, using a cbind.fill function, merge etc, but I'm missing something. If the numbers in the Impact column (114, 173 etc) end up in the output file, that's fine, as long as they are assigned the right DocID number. Thanks for your help
Upvotes: 1
Views: 206
Reputation: 263451
I couldn't get @csgillespie 's function to do the strsplit
properly, so I made my own:
foo <- function(x){ ivec <-
unlist( # needed to convert the list from strsplit to a vector.
# The regex split pattern can be read as
#---- "find any sections possibly but not necessarily starting with a space or ";"
# --- "followed necessarily by one or more digits and a ";"
# strsplit will split and remove these segments.
strsplit( as.character(x), split= "\\s?;?[[:digit:]]+;" ))
#Need to remove length zero items except for the DocID's that don't have any
if( any(nchar(ivec))>0){ ivec[nchar(ivec) >0 ] }else{""}
} # end of function.
out <- ddply(dta, .(DocID), summarise, Impact=foo(Impact) )
out
#--------------
DocID Impact
1 CCRB-9-526 Climate Change
2 CCRB-9-530 Change in Temperature
3 CCRB-9-530 Extreme weather events
4 CCRB-9-530 Lower Rainfall
5 CCRB-9-531 Nutrient trading
6 CCRB-9-569 Adaptation - Strategic
7 CCRB-9-570 Adaptation - Strategic
8 CCRB-9-671 Adaptation Responses
9 CCRB-9-671 Climate Change
10 CCRB-9-886
11 CCRB-9-989
12 CCRB-9-990
Construction of the test case (need to use non-whitespace separator):
dta <- read.table(text="DocID | Impact
CCRB-9-569 | 114;Adaptation - Strategic
CCRB-9-531 | 173;Nutrient trading
CCRB-9-886 |
CCRB-9-989 |
CCRB-9-530 | 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671 | 106;Adaptation Responses;98;Climate Change
CCRB-9-570 | 114;Adaptation - Strategic
CCRB-9-990 |
CCRB-9-526 | 98;Climate Change", header=TRUE, sep="|")
Upvotes: 2
Reputation: 60492
You can do this fairly easily using the plyr
package. First, create some dummy data and load the package:
dd = data.frame(DocID = c("CCRB-9-569", "CCRB-9-530", "CCRB-9-886"),
Impact=c("114;Adaptation - Strategic",
"71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall",
""), stringsAsFactors=FALSE)
library(plyr)
Next, we create a function that will work on Impact
column:
f = function(i) {
l = unlist(strsplit(as.character(i),";"))
##Need to determine if the string was empty
if(length(l)> 1) l = l[seq(2, length(l), by=2)]
return(l)
}
Then we use ddply
:
ddply(dd, "DocID", summarise, Impact = f(Impact))
Here we have dd
as the input, separate out by DocID and apply the functionf f
to out Impact chunk.
Note, my function f
assumes you want to split the string at ;
Function logic
The plyr
function "creates" smaller data frames conditional on their DocID
values. I then assume that for a particular DocID
value has the format:
Number;string;Number;string;Number;string
When we split based on ;
, we get the vector:
Number, string, Number, string, Number, string
So we just need to choose the even elements, i.e.
l[seq(2, length(l), 2)]
Upvotes: 0
Reputation: 115435
a similar, data.table
solution
# some dummy data
.data <- data.frame(id = letters[1:5], text = c('12;a-b;34','','a-c','a-c;12;12',''))
# make both columns character, not factor, and make it a data.table
.data <- as.data.table(lapply(.data, as.character))
# for each id, split and return (returning '' if nothing)
.data[, { value = unlist(strsplit(text,split = '\\;'))
if (length(value) == 0) text else value },
by = id]
Upvotes: 3