Reputation: 967
I have a dataframe with a few character columns and a date column and a string column.
One of the columns is a list of cities and I'd like to know which cities show up the most in my dataset. I used table(dataframe$city)
, but it gave me a list of every city (including cities that show up just once or twice).
How do I filter the results of my city to show just the cities in the top quartile, based on the number of times they appear in the data?
here's example input :
id price city
1 $0.8 los angeles
2 $0.8 new york
3 $0.5 new york
4 $0.6 new york
5 $0.9 los angeles
6 $0.1 houston
7 $0.7 chicago
8 $0.8 new york
9 $0.7 new york
10 $0.0 new york
11 $0.5 new york
12 $0.1 new york
13 $0.9 new york
14 $0.3 los angeles
15 $0.9 los angeles
16 $0.9 los angeles
17 $0.8 los angeles
18 $0.5 miami
19 $0.9 boston
20 $1.0 newton
21 $0.2 san mateo
22 $0.3 milbrae
When I do table(dataframe$city)
, I get a list of every city and a count of how many times it appears. What if I just want a list of the cities that appear more than average (like new york and los angeles)?
Upvotes: 1
Views: 132
Reputation: 2431
For one factor, table will return an array (similar to a vector). So in order to 'filter' it, you will need to subset it with whatever conditions you specify.
tbl <- table(dataframe$city)
quants <- quantile(tbl)
tbl[tbl >= quants['75%']]
Edit:
OP changed the request from top quartile to 'above average'. That's even simpler:
tbl <- table(dataframe$city)
tbl[tbl >= mean(tbl)]
Upvotes: 2
Reputation: 364
Yet another example:
# some example data... pretend letters are cities
dat <- sample(letters,250,replace=T)
# take the full table
tab <- table(dat)
# Here's my inelegant solution:
ord <- sort(tab/sum(tab),decreasing=T)
len <- length(tab)
top25.percentile <- floor(len/4)
show.nms <- names(ord[1:top25.percentile])
tab[which(names(tab) %in% show.nms)]
Upvotes: 1
Reputation: 423
A dplyr option using dummy data...
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.3.4 ✔ dplyr 0.7.4
#> ✔ tidyr 0.7.2 ✔ stringr 1.2.0
#> ✔ readr 1.1.1 ✔ forcats 0.2.0
#> Warning: package 'tidyr' was built under R version 3.4.2
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.2
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
city_data <- tibble(city = c("LA", "DC", "DC", "LA", "CL", "DC", "NY"), A = sample(1:10, 7), B = sample(1:10, 7))
city_data %>%
count(city) %>%
filter(., n> nrow(.)/4)
#> # A tibble: 2 x 2
#> city n
#> <chr> <int>
#> 1 DC 3
#> 2 LA 2
Upvotes: 1
Reputation: 23
Here's an example you can run:
dat <- sample(1:10, size = 200, replace = TRUE)
dat_table <- table(dat)
dat_table[dat_table >= quantile(dat_table, probs = 0.75 )]
Upvotes: 2
Reputation: 145775
Try this:
tab = table(dataframe$city)
# sort it
(sort_tab = sort(tab, decreasing = T))
# take only the top quarter
head(sort_tab, length(sort_tab) / 4)
Upvotes: 1