geasa
geasa

Reputation: 65

Using geom_boxplot yields different result than base boxplot()

I'm using the gapminder dataset to practice some basic data analysis on the data frame. I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.

install.packages("gapminder")
library(gapminder)
data("gapminder")

    > gapminder
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ... with 1,694 more rows

I'm subsetting the information I want like so :

df <- subset(gapminder, country =="Argentina" | country == "New Zealand")

> df
# A tibble: 24 x 6
   country   continent  year lifeExp      pop gdpPercap
   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
 1 Argentina Americas   1952    62.5 17876956     5911.
 2 Argentina Americas   1957    64.4 19610538     6857.
 3 Argentina Americas   1962    65.1 21283783     7133.
 4 Argentina Americas   1967    65.6 22934225     8053.
 5 Argentina Americas   1972    67.1 24779799     9443.
 6 Argentina Americas   1977    68.5 26983828    10079.
 7 Argentina Americas   1982    69.9 29341374     8998.
 8 Argentina Americas   1987    70.8 31620918     9140.
 9 Argentina Americas   1992    71.9 33958947     9308.
10 Argentina Americas   1997    73.3 36203463    10967.
# ... with 14 more rows

This works great as you can see (or that's what it seems)

Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:

boxplot(lifeExp ~ country)

enter image description here

This is what I want, but the x axis is also taking into account all the other countries I did not select. Clearly their data is null but it makes the plot unreadable.

Instead if I use the same data and everything on ggplot, then it works perfectly:

ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()

enter image description here

Is there something wrong I'm doing while defining the subset? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.

Upvotes: 2

Views: 90

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76402

Start with the code posted in the question.

library(gapminder)
data("gapminder")

df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)

The plot shows space for all countries because country is a factor and subsetting keeps its original levels. With str, it can be seen what df is:

str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country  : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year     : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp  : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop      : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...

The factor country has 142 levels.
The solution is to drop the extra levels.

df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)

enter image description here

Upvotes: 3

Related Questions