Reputation: 628
I have a field of semantic tags/semantic tag categories, along with a Source, Date, & ID variables. I want to break out the semantic tag field into the respective tags/tag categories, then transpose the dataset. I have most of the code worked out, but am still stuck on getting the ID/Date/Source variables to list down the matrix I create from the tag categories/tags. An example of the data I start with (tab-delimited) is below:
ID Source Date Semantic Tags
1 thestate 2013-01-18 Person:elizabeth colbert-busch, Organization:congress
2 abcnews4 2013-04-03 PoliticalEvent:congressional race, Person:colbert busch, topicname:politics
3 Politics 2013-04-02 Person:mark sanford, Person:elizabeth colbert busch, Person:colbert busch, Organization:republican party
I want the data to look like a database format (also tab-delimited):
ID Source Date Tag Type Tag
1 thestate 2013-01-18 Person elizabeth colbert-busch
1 thestate 2013-01-18 Organization congress
2 abcnews 2013-04-03 Political event congressional race
2 abcnews 2013-04-04 Person colbert-busch
2 abcnews 2013-04-05 topicname politics
3 Politics 2013-04-02 person mark sanford
3 Politics 2013-04-03 person elizabeth colbert-busch
3 Politics 2013-04-04 organization republican party
I'm having no trouble separating the tag types & tags (thnx @Tyler Rinker for help on that...), but when I am stuck on getting the ID, Source, & Date variables to repeat listwise down the tag type/tag matrix that I create. Can anyone help? My code is below:
et3 <- lapply(strsplit(as.character(et$Semantic.Tags), ","), function(x) gsub("^//s+|//s+$", "", x)) # break out semantic tags/tag type by comma
et3 <- lapply(et3, strsplit, ":(?!/)", perl=TRUE) # break on colon
The following lines of code, where I try to replicate the other three variables, is where I have problems:
Date <- rep(et$Date, seq_along(et3), sapply(et3, length))
ID <- rep(et$ID, seq_along(et3), sapply(et3, length)) # Note that if I don't use "et$ID", the IDs replicate without issue...
...And likewise for variable Source. The warning msg I receive is: In rep(et$Date, seq_along(et3), sapply(et3, length)): first element used of 'length.out' argument.
And only the first value appears in the output. The same problem happens if I first bind the et3 lists as a matrix. Can anyone help on repeating the variables down a matrix/list? I have also tried to use a transpose command, but I don't know how to treat the tags that I turned into lists.
thanks for anyone's help.
Upvotes: 1
Views: 771
Reputation: 115382
# 1. create a matrix containing the expanded information for each row
#
et3 <- lapply(et3, function(x) {xx <- do.call(rbind, x)
colnames(xx) <- c('tag','value')
xx})
# 2. cycle through each row and recombine
do.call(rbind, lapply(seq_len(nrow(edt)),
function(x) cbind(edt[x, 1:3, drop = FALSE], et3[[x]])))
data.table approach
# an alternative is to use data.table
library(data.table)
EDT <- data.table(edt)
# string processing
EDT[, sc := lapply(strsplit(as.character(Semantic.Tags), ","), function(x) gsub("^//s+|//s+$", "", x)) ]
EDT[, et3 := lapply(et3, strsplit, ":(?!/)", perl=TRUE)]
# rapply and by to create data.table
EDT[, list(tag = rapply(et3, classes = 'character', function(x)x[1]),
value = rapply(et3, classes = 'character', function(x)x[2])),
by = list(ID, Source,Date)]
ID Source Date tag value
1: 1 thestate 2013-01-18 Person elizabeth colbert-busch
2: 1 thestate 2013-01-18 Organization congress
3: 2 abcnews4 2013-04-03 PoliticalEvent congressional race
4: 2 abcnews4 2013-04-03 Person colbert busch
5: 2 abcnews4 2013-04-03 topicname politics
6: 3 Politics 2013-04-02 Person mark sanford
7: 3 Politics 2013-04-02 Person elizabeth colbert busch
8: 3 Politics 2013-04-02 Person colbert busch
9: 3 Politics 2013-04-02 Organization republican party
Upvotes: 4