Reputation: 65
I'm using the gapminder dataset to practice some basic data analysis on the data frame. I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so :
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.
Instead if I use the same data and everything on ggplot, then it works perfectly:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
Upvotes: 2
Views: 90
Reputation: 76402
Start with the code posted in the question.
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country
is a factor and subsetting keeps its original levels. With str
, it can be seen what df
is:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country
has 142 levels.
The solution is to drop the extra levels.
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)
Upvotes: 3