Reputation: 195
I have two simple data frames containing both the columns "word" and "n" for how often a certain word occurred. Here is an example:
df1 <- data.frame(word=c("beautiful","nice","like","good"),n=c(400,378,29,10))
df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad","happy"),n=c(6000,20,5,150,300,26,17,195))
Besides the words of df1
, df2
contains much more words so df1
is only a small subset of df2
.
I found the words, that are contained in both, df1
and df2
. Now I would like to subtract the word countings of df1
from df2
if the specific word is contained in df2
, meaning I would like to do the following:
df2$n - df1$n
df1$word
is contained in df2$word
I hope that my problem is clear.
I already found all the words from df1 that are also contained in df2
df1 %>% filter(df1$word %in% df2$word)
However, I am struggling with the subtracting command based on the condition that the words in df1 must be also in df2
and then only subtract df2$n - df1$n
Thank you for your help!
Upvotes: 2
Views: 2341
Reputation: 263311
Here's a vectorized base solution where Boolean multiplication is used to replace an if-then construct used in the for-lop from @Rob:
df2$n.adjusted <- df2$n - (df2$word %in% df1$word)* # zero if no match
df1$n[ match(df1$word, df2$word) ] # gets order correct
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17
8 happy 195 195
Here's the example I used to test where the order of the df1 words was not the same as the order in df2 and the lengths were not an even multiple:
> df1 <-data.frame(word=c("nice","beautiful","like","good"),n=c(378,400,29,10))
> df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad"),n=c(6000,20,5,150,300,26,17))
>
> df1
word n
1 nice 378
2 beautiful 400
3 like 29
4 good 10
> df2
word n
1 beautiful 6000
2 nice 20
3 like 5
4 good 150
5 wonderful 300
6 awesome 26
7 sad 17
> df2$n.adjusted <- df2$n - (df2$word %in% df1$word)*df1$n[match(df1$word, df2$word)]
Warning message:
In (df2$word %in% df1$word) * df1$n[match(df1$word, df2$word)] :
longer object length is not a multiple of shorter object length
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17
Upvotes: 2
Reputation: 277
Here is a quick solution using a for loop and the %in%
operator.
df2$diff <- NA
for (i in 1:nrow(df2)) {
if (df2$word[i] %in% df1$word[i]) {
df2$diff[i] <- df2$n[i] - df1$n[i]
}
}
df2
Output:
> df2
word n diff
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 NA
6 awesome 26 NA
7 sad 17 NA
8 happy 195 NA
Upvotes: 2
Reputation: 28675
require(dplyr)
df1 %>%
inner_join(df2, by = 'word') %>%
mutate(diff = n.y - n.x) %>%
select(word, diff)
Gives
word diff
1 beautiful 5600
2 nice -358
3 like -24
4 good 140
Upvotes: 2
Reputation: 61154
Using merge
:
> df.tmp <- merge(df1, df2, by="word", all=TRUE)
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
5 awesome NA 26 NA
6 happy NA 195 NA
7 sad NA 17 NA
8 wonderful NA 300 NA
If you only want matched words
> df.tmp <- merge(df1, df2, by="word")
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
Upvotes: 4