Elizabeth
Elizabeth

Reputation: 149

R separating strings and linking them to another column

I have data that looks like this:

DocID             Impact
CCRB-9-569  114;Adaptation - Strategic
CCRB-9-531  173;Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671  106;Adaptation Responses;98;Climate Change
CCRB-9-570  114;Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  98;Climate Change

Ideally I want to end up with:

DocID             Impact
CCRB-9-569  Adaptation - Strategic
CCRB-9-531  Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  Change in Temperature
CCRB-9-530  Extreme weather events
CCRB-9-530  Lower Rainfall
CCRB-9-671  Adaptation Responses
CCRB-9-671  Climate Change
CCRB-9-570  Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  Climate Change

I started out trying

test1=lapply(unlist(strsplit(test$Impact,"\\;")),as.character)

but then don't have the ability to link back to DocID and don't get any spaces for the rows with no input. I've played around with leaving out the unlist, trying to relist, using a cbind.fill function, merge etc, but I'm missing something. If the numbers in the Impact column (114, 173 etc) end up in the output file, that's fine, as long as they are assigned the right DocID number. Thanks for your help

Upvotes: 1

Views: 206

Answers (3)

IRTFM
IRTFM

Reputation: 263451

I couldn't get @csgillespie 's function to do the strsplit properly, so I made my own:

 foo <- function(x){  ivec <-                   
  unlist(    # needed to convert the list from strsplit to a vector.
  # The regex split pattern can be read as 
     #---- "find any sections possibly but not necessarily starting with a space or ";"
     # --- "followed necessarily by one or more digits and a ";"
  # strsplit will split and remove these segments.

     strsplit( as.character(x), split= "\\s?;?[[:digit:]]+;" ))   

   #Need to remove length zero items except for the DocID's that don't have any   

     if( any(nchar(ivec))>0){ ivec[nchar(ivec) >0 ] }else{""}
    } # end of function.

 out <- ddply(dta, .(DocID), summarise, Impact=foo(Impact) )
 out
#--------------
         DocID                 Impact
1  CCRB-9-526          Climate Change
2  CCRB-9-530   Change in Temperature
3  CCRB-9-530  Extreme weather events
4  CCRB-9-530          Lower Rainfall
5  CCRB-9-531        Nutrient trading
6  CCRB-9-569  Adaptation - Strategic
7  CCRB-9-570  Adaptation - Strategic
8  CCRB-9-671    Adaptation Responses
9  CCRB-9-671          Climate Change
10 CCRB-9-886                        
11 CCRB-9-989                        
12 CCRB-9-990                        

Construction of the test case (need to use non-whitespace separator):

dta <- read.table(text="DocID     |        Impact
 CCRB-9-569 | 114;Adaptation - Strategic
 CCRB-9-531 | 173;Nutrient trading
 CCRB-9-886 | 
 CCRB-9-989 | 
 CCRB-9-530 | 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
 CCRB-9-671 | 106;Adaptation Responses;98;Climate Change
 CCRB-9-570 | 114;Adaptation - Strategic
 CCRB-9-990 | 
 CCRB-9-526 | 98;Climate Change", header=TRUE, sep="|")

Upvotes: 2

csgillespie
csgillespie

Reputation: 60492

You can do this fairly easily using the plyr package. First, create some dummy data and load the package:

dd = data.frame(DocID = c("CCRB-9-569", "CCRB-9-530", "CCRB-9-886"),
                 Impact=c("114;Adaptation - Strategic", 
     "71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall",
                          ""), stringsAsFactors=FALSE)
library(plyr)

Next, we create a function that will work on Impact column:

f = function(i) { 
    l = unlist(strsplit(as.character(i),";"))
    ##Need to determine if the string was empty
    if(length(l)> 1) l = l[seq(2, length(l), by=2)]
    return(l)

}

Then we use ddply:

ddply(dd, "DocID", summarise, Impact = f(Impact))

Here we have dd as the input, separate out by DocID and apply the functionf f to out Impact chunk.


Note, my function f assumes you want to split the string at ;

Function logic

The plyr function "creates" smaller data frames conditional on their DocID values. I then assume that for a particular DocID value has the format:

 Number;string;Number;string;Number;string

When we split based on ;, we get the vector:

Number, string, Number, string, Number, string

So we just need to choose the even elements, i.e.

l[seq(2, length(l), 2)]

Upvotes: 0

mnel
mnel

Reputation: 115435

a similar, data.table solution

# some dummy data
.data <- data.frame(id = letters[1:5], text = c('12;a-b;34','','a-c','a-c;12;12',''))
# make both columns character, not factor, and make it a data.table
.data <- as.data.table(lapply(.data, as.character))
# for each id, split and return (returning '' if nothing)


.data[, { value = unlist(strsplit(text,split = '\\;')) 
          if (length(value) == 0) text else value },
        by = id]

Upvotes: 3

Related Questions