Reputation: 34763
EDIT: Revamped reflecting comments below
I've got some data on workers across time. They may work in more than one position in any given year; I want to subset the data to get workers who have at least one position with certain characteristics.
Here's my toy data:
set.seed(1643)
dt<-data.table(id=rep(1:1000,10),
area=sample(letters,1e4,replace=T),
position=sample(10,1e4,replace=T),
firm_type=sample(5,1e4,replace=T),
year=rep(2001:2010,each=1000),key="id")
I only want workers in area
d
,o
,w
,l
,e
at position
7
.
Unfortunately, the way that firm_type
was coded changed from 2005 onwards; prior to 2005, the relevant workers were all at firms with firm_type==1
. Thereafter, types 1
and 2
are acceptable.
I tried this look-up but it doesn't work:
dt[.(dt[firm_type %in% ifelse(year<2005,1,1:2)
&area %in% c("d","o","w","l","e")
&position==7,unique(id)])]
Specifically, the %in%
operator, as mentioned in the comments below, can't operate row-by-row, so we get (intermediate) output like:
> dt[firm_type %in% ifelse(year<2005,1,1:2)
+ &area %in% c("d","o","w","l","e")
+ &position==7,table(firm_type,year)]
year
firm_type 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 4 2 5 2 3 7 1 0 4 1
2 2 4 4 6 4 5 9 8 1 2
@Frank inspired this workaround:
dt[.(dt[ifelse(year<2005,firm_type==1,
firm_type %in% 1:2)
&area %in% c("d","o","w","l","e")
&position==7,unique(id)])]
I'm content with this, but I was encouraged to see if there's a better approach to this problem because ifelse
is not optimized.
Upvotes: 0
Views: 77
Reputation: 18743
Just use an "or" in addition to "and":
> dt[((firm_type == 1 ) | (firm_type ==2 & year>=2005))
+ &area %in% c("d","o","w","l","e")
+ &position==7,]
Upvotes: 0
Reputation: 66819
A faster way. You could defer the ifelse
until you have a smaller subset:
dt[ position==7L & area%in%c("d","o","w","l","e") & firm_type%in%1:2
][ifelse(year<2005,firm_type==1L,firm_type %in% 1:2),
unique(id)
]
Depending on how readable you find it, you could also do:
dt[ position==7L & area%in%c("d","o","w","l","e") & firm_type%in%1:2
][!(year < 2005 & firm_type==2L),
unique(id)
]
Regarding ifelse. ifelse(cond,yes,no)
is slow because it calculates all of yes
and no
if it needs either of them, as documented by @RicardoSaporta. Another idea -- (cond&yes)|((!cond)&no)
-- mentioned in an earlier iteration of the OP, has the same problem.
The verbose way. If your conditions are messier, you might want to make them explicit:
my_areas = c("d","o","w","l","e")
my_posns = 7L
my_yearfirms = data.table(year=unique(dt$year))[,.(
firm_type = if (year<2005) 1L else 1:2
),by=year]
merge(dt[position%in%my_posns & area%in%my_areas],my_yearfirms,by=c("year","firm_type"))[,
unique(id)
]
The final piece of code can be
This is what I would do unless efficiency was very important.
Upvotes: 1