Sebastian
Sebastian

Reputation: 967

How do I filter the results in table()

I have a dataframe with a few character columns and a date column and a string column.

One of the columns is a list of cities and I'd like to know which cities show up the most in my dataset. I used table(dataframe$city), but it gave me a list of every city (including cities that show up just once or twice).

How do I filter the results of my city to show just the cities in the top quartile, based on the number of times they appear in the data?

here's example input :

id  price   city
1   $0.8    los angeles
2   $0.8    new york
3   $0.5    new york
4   $0.6    new york
5   $0.9    los angeles
6   $0.1    houston
7   $0.7    chicago
8   $0.8    new york
9   $0.7    new york
10  $0.0    new york
11  $0.5    new york
12  $0.1    new york
13  $0.9    new york
14  $0.3    los angeles
15  $0.9    los angeles
16  $0.9    los angeles
17  $0.8    los angeles
18  $0.5    miami
19  $0.9    boston
20  $1.0    newton
21  $0.2    san mateo
22  $0.3    milbrae

When I do table(dataframe$city), I get a list of every city and a count of how many times it appears. What if I just want a list of the cities that appear more than average (like new york and los angeles)?

Upvotes: 1

Views: 132

Answers (5)

David Klotz
David Klotz

Reputation: 2431

For one factor, table will return an array (similar to a vector). So in order to 'filter' it, you will need to subset it with whatever conditions you specify.

tbl <- table(dataframe$city)
quants <- quantile(tbl)

tbl[tbl >= quants['75%']]

Edit:

OP changed the request from top quartile to 'above average'. That's even simpler:

tbl <- table(dataframe$city)
tbl[tbl >= mean(tbl)]

Upvotes: 2

Nate
Nate

Reputation: 364

Yet another example:

# some example data... pretend letters are cities
dat <- sample(letters,250,replace=T)

# take the full table
tab <- table(dat)

# Here's my inelegant solution:
ord <- sort(tab/sum(tab),decreasing=T)
len <- length(tab)
top25.percentile <- floor(len/4)
show.nms <- names(ord[1:top25.percentile])
tab[which(names(tab) %in% show.nms)]

Upvotes: 1

Ryan
Ryan

Reputation: 423

A dplyr option using dummy data...


library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.3.4     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.1.1     ✔ forcats 0.2.0
#> Warning: package 'tidyr' was built under R version 3.4.2
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.2
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

city_data <- tibble(city = c("LA", "DC", "DC", "LA", "CL", "DC", "NY"), A = sample(1:10, 7), B = sample(1:10, 7))

city_data %>%
  count(city) %>%
  filter(., n> nrow(.)/4)
#> # A tibble: 2 x 2
#>    city     n
#>   <chr> <int>
#> 1    DC     3
#> 2    LA     2

Upvotes: 1

lukembrowne
lukembrowne

Reputation: 23

Here's an example you can run:

dat <- sample(1:10, size = 200, replace = TRUE)

dat_table <- table(dat)

dat_table[dat_table >= quantile(dat_table, probs = 0.75 )]

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145775

Try this:

tab = table(dataframe$city)
# sort it
(sort_tab = sort(tab, decreasing = T))
# take only the top quarter
head(sort_tab, length(sort_tab) / 4)

Upvotes: 1

Related Questions