Reputation: 28427
I have a dataframe df
that contains a couple of columns, but the only relevant ones are given below.
node | precedingWord
-------------------------
A-bom de
A-bom die
A-bom de
A-bom een
A-bom n
A-bom de
acroniem het
acroniem t
acroniem het
acroniem n
acroniem een
act de
act het
act die
act dat
act t
act n
I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter
, another non-neuter
and a last one rest
. neuter
would contain all values for which precedingWord is one of these values: t
,het
, dat
. non-neuter
would contain de
and die,
and rest
would contain everything that doesn't belong into neuter
or non-neuter
. (It would be nice if this could be dynamic, in other words that rest
uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)
Example output (in a new dataframe, let's say freqDf
, would look like this:
node | neuter | nonNeuter | rest
-----------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
To create freqDf$node, I can do this:
freqDf<- data.frame(node = unique(df$node), stringsAsFactors = FALSE)
But that's already all I got; I don't know how to continue. I figured I could do something like this, but unfortunately the ++
operator doesn't work as I had hoped.
freqDf$neuter[grep("dat|het|t", df$precedingWord, perl=TRUE)] <- ++
freqDf$nonNeuter[grep("de|die", df$precedingWord, perl=TRUE)] <- ++
e <- table(df$Node)
freqDf$rest <- as.numeric(e - freqDf$neuter - freqDf$nonNeuter)
Also, this won't work for each node individually. I need some sort of loop that automatically runs for each different value in freqDf$node
.
Upvotes: 2
Views: 1126
Reputation: 7830
One way is to replace the values by their categories and then use the table
function to generate the frequecies.
neuter <- c("t", "het", "dat")
non.neuter <- c("de", "die")
df$precedingWord[df$precedingWord %in% neuter] <- "neuter"
df$precedingWord[df$precedingWord %in% non.neuter] <- "non.neuter"
df$precedingWord[!df$precedingWord %in% c(neuter, non.neuter)] <- "rest"
table(df)
precedingWord
node neuter non.neuter rest
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
But I'm sure there is a better solution with the dplyr package for example.
EDIT : Maybe something like that : (It dont overwrite your "precedingWord" column but add a new "gender" one)
library(dplyr)
df %>%
mutate(gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest",
ifelse(precedingWord %in% neuter, "neuter", "non.neuter"))) %>%
count(node, gender)
Source: local data frame [7 x 3]
Groups: node
node gender n
1 A-bom non.neuter 4
2 A-bom rest 2
3 acroniem neuter 3
4 acroniem rest 2
5 act neuter 3
6 act non.neuter 2
7 act rest 1
# And if you want the same output you put in your question, you can use table
df2 <- mutate(df, gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest",
ifelse(precedingWord %in% neuter, "neuter", "non.neuter")))
table(df2$node, df2$gender)
neuter non.neuter rest
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
Edit : Convert table to a manipulable data frame
myTable <- table(df2$node, df2$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
> myTable
neuter non.neuter rest node
1 0 4 2 A-bom
2 3 0 2 acroniem
3 3 2 1 act
> str(myTable)
'data.frame': 3 obs. of 4 variables:
$ neuter : int 0 3 3
$ non.neuter: int 4 0 2
$ rest : int 2 2 1
$ node : chr "A-bom" "acroniem" "act"
# And here is a more understandable way if you are not familiar with piping
# To learn more about forward piping : https://github.com/smbache/magrittr
myTable <- table(df2$node, df2$gender)
myTable2 <- as.data.frame.matrix(myTable)
myTable3 <- mutate(myTable2, node = row.names(myTable2))
Upvotes: 1
Reputation: 4868
R usually doesn't require looping. It's designed to act on all elements of a data structure using vectors and the apply
commands. In this case you don't need to use tapply
because the table
function already does what you want.
Julien's answer works for your example, but in the (probably unusual) case that no words of a given type are present, it will fail. For example, if you had no "neuter" words then "neuter" would be missing from the table instead of showing all zeroes as expected. To deal with this, you can treat word type as a factor.
Note that in the code below, I added a fourth type of word ("nonword") to demonstrate the zero-words case.
df<-as.data.frame(matrix(c("A-bom","de","A-bom","die","A-bom","de","A-bom","een","A-bom","n","A-bom","de","acroniem","het","acroniem","t","acroniem","het","acroniem","n","acroniem","een","act","de","act","het","act","die","act","dat","act","t","act","n"), byrow=T, ncol=2), stringsAsFactors=F)
names(df)<-c("node", "precedingWord")
# dictionary of word types.
# I added a fourth type of word to demonstrate what happens
# if no words of a given type are present.
classes<-c("t"="neuter", "het"="neuter" ,"dat"="neuter", "de"="non-neuter", "die"="non-neuter", "blorble"="nonword")
# create class variable and initialize to "rest"
df$class<-"rest"
df$class<-ifelse(!is.na(classes[df$precedingWord]), classes[df$precedingWord], "rest")
# note fourth category, "nonword", is missing.
table(df$node, df$class)
# make sure any missing categories are still possible levels for class
df$class<-factor(df$class)
levels(df$class)<-c(levels(df$class), unique(classes))
#now non-represented categories are still there.
table(df$node, df$class)
Upvotes: 1