Reputation: 4022

Explain ggplot2 warning: "Removed k rows containing missing values"

I get this warning when I am trying to generate a plot with ggplot.

After researching online for a while many suggested that my database contains either null values or missing data in general, which was not the case.

In this question the accepted answer says the following:

The warning means that some elements are removed because they fall out of the specified range

I was wondering what exactly does this range refer to and how can someone manually increase this range in order to avoid all warnings?

Upvotes: 125

Answers (7)

Mike M

Reputation: 1425

Another important variation on this for Histograms developed in 2019.

The computation was updated and includes something - excuse me for maybe oversimplifying -
like creating bins according to the limit=() specification on your binning axis,
and if this is wider than your data,

it creates bars with 0 frequency
it creates this same warning for those bars

– a completely different problem than clipping your data but with the same message.

The ggplot developers suggested in github handling this a few different ways, https://github.com/tidyverse/ggplot2/issues/3265
https://github.com/tidyverse/ggplot2/issues/4083

1a. If using `limit=()` inside an axis declaration, set your `limit=()` precisely to the range of your bars, even if your axis breaks are wider

or 1b. Add option `oob = scales::oob_keep` to the offending axis declaration

or 2. If using `xlim` or `ylim`, wrap it in `coord_cartesian()`

Upvotes: 2

DiegoJArg

Reputation: 183

I know this question already has an answer, but this is another possible solution for you. As you don't provide a sample code, I couldn't know for sure.

If you just want to get rid of it, that implies to me that you are OK with the output. Then you can try the following:

Add na.rm=TRUE to geom_something like : geom_line(..., na.rm=TRUE )

This explicitly tells geom_line (and geom_path) that is OK to remove NA values.

Analyzing the warning message:

Warning of: Removed k rows containing missing values (geom_path)

This tells you mainly 3 things:

geom_path is being called by another geom_something which is firing the warning.
It already removed k rows. So if the output is as desired, then you want those rows removed.
The reason for removal is that some values ARE missing (NA).

What the warning doesn't tells you is WHY those rows have missing (NA) values, that only you may know.

An usual reason comes from setting limits to the scale. Like scale_x_datetime or scale_y_continuous.

This makes sense as (X,Y) pairs, to be drawn, requires not to be NA.

When you set the X scale to larger values where there is no Y, OR your Y data is NA. You get (X,Y) points where one of both is NA.

You may want to set a larger scale for a different number of reasons, but ggplot will always find that there isn't an associated Y value, and it makes sense to fire a warning instead of an error.

Have a nice day.

Upvotes: 6

eipi10

Reputation: 93761

The behavior you're seeing is due to how ggplot2 deals with data that are outside the axis ranges of the plot. scale_y_continuous (or, equivalently, ylim) excludes values outside the plot area when calculating statistics, summaries, or regression lines. coord_cartesian includes all values in these calculations, regardless of whether they are visible in the plot area. Here are some examples:

library(ggplot2)

# Set one point to a large hp value
d = mtcars
d$hp[d$hp==max(d$hp)] = 1000

All points are visible in this plot:

ggplot(d, aes(mpg, hp)) + 
  geom_point() +
  geom_smooth(method="lm") +
  labs(title="All points are visible; no warnings")
#> `geom_smooth()` using formula 'y ~ x'

In the plot below, one point with hp = 1000 is outside the y-axis range of the plot. Because we used scale_y_continuous to set the y-axis range, this point is not included in any other statistics or summary measures calculated by ggplot, such as the linear regression line calculated by geom_smooth. ggplot also provides warnings about the excluded point.

ggplot(d, aes(mpg, hp)) + 
  geom_point() +
  scale_y_continuous(limits=c(0,300)) +  # Change this to limits=c(0,1000) and the warning disappears
  geom_smooth(method="lm") +
  labs(title="scale_y_continuous: excluded point is not used for regression line")
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 1 rows containing non-finite values (stat_smooth).
#> Warning: Removed 1 rows containing missing values (geom_point).

In the plot below, the point with hp = 1000 is still outside the y-axis range of the plot. However, because we used coord_cartesian, this point is nevertheless included in any statistics or summary measures that ggplot calculates, such as the linear regression line.

If you compare this and the previous plot, you can see that the linear regression line in the second plot has a much steeper slope and wider confidence bands, because the point with hp=1000 is included when calculating the regression line, even though it's not visible in the plot.

ggplot(d, aes(mpg, hp)) + 
  geom_point() +
  coord_cartesian(ylim=c(0,300)) +
  geom_smooth(method="lm") +
  labs(title="coord_cartesian: excluded point is still used for regression line")
#> `geom_smooth()` using formula 'y ~ x'

Upvotes: 118

ali

Reputation: 11

Another reason for that, is existence of NA's. Suppose your array name is arr. You can simply check if you have any NA's in your array by:

any(is.na(arr))

If the answer was TRUE, then you have to delete NA's as below:

arr = arr[-which(is.na(arr)]

Even without any(is.na(arr)), you can simply run the above command and R will remove any NA's that might have existed.

Upvotes: 0

davidnortes

Reputation: 922

Just for the shake of completing the answer given by eipi10.

I was facing the same problem, without using scale_y_continuous nor coord_cartesian.

The conflict was coming from the x axis, where I defined limits = c(1, 30). It seems such limits do not provide enough space if you want to "dodge" your bars, so R still throws the error

Removed 8 rows containing missing values (geom_bar)

Adjusting the limits of the x axis to limits = c(0, 31) solved the problem.

In conclusion, even if you are not putting limits to your y axis, check out your x axis' behavior to ensure you have enough space

Upvotes: 23

Rob

Reputation: 277

I ran into this as well, but in the case where I wanted to avoid the extra error messages while keeping the range provided. An option is also to subset the data prior to setting the range, so that the range can be kept however you like without triggering warnings.

library(ggplot2)

range(mtcars$hp)
#> [1]  52 335

# Setting limits with scale_y_continous (or ylim) and subsetting accordingly
## avoid warning messages about removing data
ggplot(data= subset(mtcars, hp<=300 & hp >= 100), aes(mpg, hp)) + 
  geom_point() +
  scale_y_continuous(limits=c(100,300))

Upvotes: 0

Todd

Reputation: 588

Even if your data falls within your specified limits (e.g. c(0, 335)), adding a geom_jitter() statement could push some points outside those limits, producing the same error message.

library(ggplot2)

range(mtcars$hp)
#> [1]  52 335

# No jitter -- no error message
ggplot(mtcars, aes(mpg, hp)) + 
    geom_point() +
    scale_y_continuous(limits=c(0,335))


# Jitter is too large -- this generates the error message
ggplot(mtcars, aes(mpg, hp)) + 
    geom_point() +
    geom_jitter(position = position_jitter(w = 0.2, h = 0.2)) +
    scale_y_continuous(limits=c(0,335))
#> Warning: Removed 1 rows containing missing values (geom_point).

^{Created on 2020-08-24 by the reprex package (v0.3.0)}

Upvotes: 1

Explain ggplot2 warning: &quot;Removed k rows containing missing values&quot;

Answers (7)

Another important variation on this for Histograms developed in 2019.

1a. If using limit=() inside an axis declaration, set your limit=() precisely to the range of your bars, even if your axis breaks are wider

or 1b. Add option oob = scales::oob_keep to the offending axis declaration

or 2. If using xlim or ylim, wrap it in coord_cartesian()

Analyzing the warning message:

Related Questions

Explain ggplot2 warning: "Removed k rows containing missing values"

1a. If using `limit=()` inside an axis declaration, set your `limit=()` precisely to the range of your bars, even if your axis breaks are wider

or 1b. Add option `oob = scales::oob_keep` to the offending axis declaration

or 2. If using `xlim` or `ylim`, wrap it in `coord_cartesian()`