supersambo
supersambo

Reputation: 811

R Frequency table containing 0

I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.

recently I did it converting the column to character like this

>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678

2 months ago I did it like that

>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594

I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.

?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.

What I'm doing wrong. Or am I missunderstanding the use of tables?

Upvotes: 4

Views: 5643

Answers (1)

Julius Vainora
Julius Vainora

Reputation: 48241

The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:

# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
  x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame':   3 obs. of  2 variables:
 $ x: Factor w/ 3 levels "a","b","c": 1 2 3
 $ y: int  1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
  x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame':   2 obs. of  2 variables:
 $ x: Factor w/ 3 levels "a","b","c": 1 2
 $ y: int  1 2

so the first column contains factors. In this case:

table(newDf$x)

a b c 
1 1 0 

all the levels ("a","b","c") are taken into consideration. And here

table(as.character(newDf$x))

a b 
1 1 

they are not factors anymore.

Upvotes: 4

Related Questions