Selecting rows with same result in different columns in R

Question

I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.

For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".

I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.

> catch[4:6,]
      gear tripID obsID sortie setID       date     time NAFO    lat   long      dur depth bodymesh
74807 GRL2 G00001     A      1    13 2000-01-04 13:40:00   2H 562550 594350 2.000000   377       80
74809 GRL2 G00001     A      1    14 2000-01-04 23:30:00   2H 562550 594350 2.166667   370       80
74823 GRL2 G00001     A      1    16 2000-01-05 07:45:00   2H 561450 593050 3.000000   408       80
      codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp       tspp.name elasmo
74807         45       NA       NA      NA                   NA    NA 3.3 2211 Northern shrimp   2211
74809         45       NA       NA      NA                   NA    NA 3.2 2211 Northern shrimp   2211
74823         45       NA       NA      NA                   NA    NA 3.3 2211 Northern shrimp    211
          elasmo.name kept discard Tcatch     date.1 latitude longitude       EID
74807 Northern shrimp 2747      50   2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919     100   5019 2000-01-04 56.91667 -60.21667 G00001-14
74823          Skates    0      50     50 2000-01-05 56.73333 -60.00000 G00001-16
                                 fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid

Tim P · Accepted Answer

I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)

Once you've read in the data correctly, you can use either

catch[which(catch$tspp.name == catch$elasmo.name),]

or

subset(catch, tspp.name == elasmo.name)

to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.

Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.

First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...

col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA

Let's load it in without converting factors to strings, just to see the methods proposed above fall over:

> dat=read.csv("F:/test.dat",sep="~")  # don't forget to check the filename

> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different

> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different

> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different

This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.

Now let's do the same, but this time reading in the data "as is":

> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE)  # note the as.is=TRUE

> dat[which(dat$col1==dat$col2),]
  col1 col2
2    a    a
3    b    b

> dat[dat$col1==dat$col2,]
     col1 col2
2       a    a
3       b    b
NA    
NA.1  
NA.2  

> subset(dat,col1==col2)
     col1 col2
2    a    a
3    b    b

As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.

A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:

dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
                         stringsAsFactors=FALSE)

Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).

In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!

Hope this helps :)

Selecting rows with same result in different columns in R

Answers (1)

Related Questions