Reputation: 79
I am in a statistical project, I have a table with words and the frequency that each one has in a text, what I want is a sample that has as a result the words that have the most frequency
Hello good afternoon, I hope someone can help me.
I have a table with words and how often each one appears in a text.
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- c("10", "2", "5", "8", "2", "1")
table < -cbind.data.frame(word,freq)
# word freq
# 1 banana 10
# 2 watermelon 2
# 3 water 5
# 4 apple 8
# 5 blue 2
# 6 sky 1
sample(table$freq,2)
# [1] 2 5
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq<- c("10", "2", "5", "8", "2", "1")
table<-cbind.data.frame(word,freq)
sample(table$freq,2)
I want is:
# [1] 10 8
Upvotes: 0
Views: 141
Reputation: 160587
If you want weighted probability of words based on your freq
(converted to integer
), then perhaps
sample(tb$freq, size = 2, prob = tb$freq)
Let's see what the tendency is for this to prioritize the words we think we should be getting. For demonstration, I'll sample the word
based on their freq
(since that makes more sense to me), you can move variables around as you see fit.
samps <- replicate(1000, sample(tb$word, size = 2, prob = tb$freq))
str(samps)
# chr [1:2, 1:1000] "water" "apple" "water" "banana" "watermelon" "banana" ...
sort(table(samps))
# samps
# sky watermelon blue water apple banana
# 93 151 166 370 572 648
The replicate
call gives us a matrix
, so sorting the frequencies, we see that banana
is more likely than all others.
We can see that the proportions are about right with
sort(table(samps)) / sum(table(samps))
# samps
# sky watermelon blue water apple banana
# 0.0465 0.0755 0.0830 0.1850 0.2860 0.3240
tb$pct <- tb$freq / sum(tb$freq)
tb <- tb[ order(tb$pct), ]
tb
# freq word pct
# 6 1 sky 0.03571429
# 2 2 watermelon 0.07142857
# 5 2 blue 0.07142857
# 3 5 water 0.17857143
# 4 8 apple 0.28571429
# 1 10 banana 0.35714286
Data
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- as.integer(c("10", "2", "5", "8", "2", "1"))
tb <- data.frame(freq, word)
Upvotes: 2