Bram Vanroy
Bram Vanroy

Reputation: 28427

Looping and adding to a counter in R

I have a dataframe df that contains a couple of columns, but the only relevant ones are given below.

node    |   precedingWord
-------------------------
A-bom       de
A-bom       die
A-bom       de
A-bom       een
A-bom       n
A-bom       de
acroniem    het
acroniem    t
acroniem    het
acroniem    n
acroniem    een
act         de
act         het
act         die
act         dat
act         t
act         n

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest
-----------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

To create freqDf$node, I can do this:

freqDf<- data.frame(node = unique(df$node), stringsAsFactors = FALSE)

But that's already all I got; I don't know how to continue. I figured I could do something like this, but unfortunately the ++ operator doesn't work as I had hoped.

freqDf$neuter[grep("dat|het|t", df$precedingWord, perl=TRUE)] <- ++
freqDf$nonNeuter[grep("de|die", df$precedingWord, perl=TRUE)] <- ++

e <- table(df$Node)
freqDf$rest <- as.numeric(e - freqDf$neuter - freqDf$nonNeuter)

Also, this won't work for each node individually. I need some sort of loop that automatically runs for each different value in freqDf$node.

Upvotes: 2

Views: 1126

Answers (2)

Julien Navarre
Julien Navarre

Reputation: 7830

One way is to replace the values by their categories and then use the tablefunction to generate the frequecies.

neuter <- c("t", "het", "dat")
non.neuter <- c("de", "die")

df$precedingWord[df$precedingWord %in% neuter] <- "neuter"
df$precedingWord[df$precedingWord %in% non.neuter] <- "non.neuter"
df$precedingWord[!df$precedingWord %in% c(neuter, non.neuter)] <- "rest"

table(df)

      precedingWord
  node       neuter non.neuter rest
  A-bom         0          4    2
  acroniem      3          0    2
  act           3          2    1

But I'm sure there is a better solution with the dplyr package for example.

EDIT : Maybe something like that : (It dont overwrite your "precedingWord" column but add a new "gender" one)

library(dplyr)
df %>%
  mutate(gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest", 
                         ifelse(precedingWord %in% neuter, "neuter", "non.neuter"))) %>%
  count(node, gender)


Source: local data frame [7 x 3]
Groups: node

      node     gender n
1    A-bom non.neuter 4
2    A-bom       rest 2
3 acroniem     neuter 3
4 acroniem       rest 2
5      act     neuter 3
6      act non.neuter 2
7      act       rest 1

# And if you want the same output you put in your question, you can use table
df2 <- mutate(df, gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest", 
                       ifelse(precedingWord %in% neuter, "neuter", "non.neuter")))

table(df2$node, df2$gender)

           neuter non.neuter rest
  A-bom         0          4    2
  acroniem      3          0    2
  act           3          2    1

Edit : Convert table to a manipulable data frame

myTable <- table(df2$node, df2$gender) %>% 
  as.data.frame.matrix %>%
  mutate(node = row.names(.))

 > myTable
  neuter non.neuter rest     node
1      0          4    2    A-bom
2      3          0    2 acroniem
3      3          2    1      act
> str(myTable)
'data.frame':   3 obs. of  4 variables:
 $ neuter    : int  0 3 3
 $ non.neuter: int  4 0 2
 $ rest      : int  2 2 1
 $ node      : chr  "A-bom" "acroniem" "act"

# And here is a more understandable way if you are not familiar with piping
# To learn more about forward piping : https://github.com/smbache/magrittr 
myTable <- table(df2$node, df2$gender)
myTable2 <- as.data.frame.matrix(myTable)
myTable3 <- mutate(myTable2, node = row.names(myTable2))

Upvotes: 1

octern
octern

Reputation: 4868

R usually doesn't require looping. It's designed to act on all elements of a data structure using vectors and the apply commands. In this case you don't need to use tapply because the table function already does what you want.

Julien's answer works for your example, but in the (probably unusual) case that no words of a given type are present, it will fail. For example, if you had no "neuter" words then "neuter" would be missing from the table instead of showing all zeroes as expected. To deal with this, you can treat word type as a factor.

Note that in the code below, I added a fourth type of word ("nonword") to demonstrate the zero-words case.

df<-as.data.frame(matrix(c("A-bom","de","A-bom","die","A-bom","de","A-bom","een","A-bom","n","A-bom","de","acroniem","het","acroniem","t","acroniem","het","acroniem","n","acroniem","een","act","de","act","het","act","die","act","dat","act","t","act","n"), byrow=T, ncol=2), stringsAsFactors=F)
names(df)<-c("node", "precedingWord")

# dictionary of word types. 
# I added a fourth type of word to demonstrate what happens 
# if no words of a given type are present.
classes<-c("t"="neuter", "het"="neuter" ,"dat"="neuter", "de"="non-neuter", "die"="non-neuter", "blorble"="nonword")

# create class variable and initialize to "rest"
df$class<-"rest"
df$class<-ifelse(!is.na(classes[df$precedingWord]), classes[df$precedingWord], "rest")

# note fourth category, "nonword", is missing.
table(df$node, df$class)

# make sure any missing categories are still possible levels for class
df$class<-factor(df$class)
levels(df$class)<-c(levels(df$class), unique(classes))

#now non-represented categories are still there. 
table(df$node, df$class)

Upvotes: 1

Related Questions