Rob Syme
Rob Syme

Reputation: 362

Does R use inexact matching when subsetting dataframe rows using square bracket notation?

Given a simple data frame:

df <-
  structure(
    list(
      lowercase = c("j", "t", "u"),
      uppercase = c("J", "T", "U")
    ),
    row.names = c("10", "20", "21"),
    class = "data.frame"
  )
> df
   lowercase uppercase
10         j         J
20         t         T
21         u         U

Selecting using row names that do not exist usually returns a data frame of NAs:

> df["2",]
   lowercase uppercase
NA      <NA>      <NA>

... but not always:

> df["1",]
   lowercase uppercase
10         j         J

Why does subsetting a data frame sometimes return rows for which there is no (exact) matching row.name?

I've tried this on both linux (CentOS) and MacOS, using R versions 3.1.2, 3.2.3, 3.6.0, and 4.0.3 with the same results.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   ~/tools/lib64/R/lib/libRblas.so
LAPACK: ~/tools/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.2

Upvotes: 1

Views: 48

Answers (1)

Rob Syme
Rob Syme

Reputation: 362

After a much closer reading of the Extract.data.frame manual page, I see that the square bracket notation will partially match row names.

The example given in the manual is:

> sw <- swiss[1:5, 1:4]

> sw
             Fertility Agriculture Examination Education
Courtelary        80.2        17.0          15        12
Delemont          83.1        45.1           6         9
Franches-Mnt      92.5        39.7           5         5
Moutier           85.8        36.5          12         7
Neuveville        76.9        43.5          17        15

> sw["C",]
           Fertility Agriculture Examination Education
Courtelary      80.2          17          15        12

The recommended solution is to use match:

> sw[match("C", row.names(sw)), ] 
   Fertility Agriculture Examination Education
NA        NA          NA          NA        NA

Bringing this answer back to question posed above, the correct approach would be:

df[match("1", row.names(df)), ]

Upvotes: 1

Related Questions