Reputation: 838
I am trying to bucket certain features into groups. The data.frame below (grouped) is my "key" (think Excel vlookup):
Original Grouped
1 Features Constant
2 PhoneService Constant
3 PhoneServices Constant
4 Surcharges Constant
5 CallingPlans Constant
6 Taxes Constant
7 LDUsage Noise
8 RegionalUsage Noise
9 LocalUsage Noise
10 Late fees Noise
11 SpecialServices Noise
12 TFUsage Noise
13 VoipUsage Noise
14 CCUsage Noise
15 Credits Credits
16 OneTime OneTime
I then reference my database which has a column (BillSection) that takes on a specific value from grouped$Original, and I want to group it according to grouped$Grouped. I am using the sapply function to perform this operation. Then I cbind the resulting output to my original data.frame.
grouper<-as.character(sapply(as.character(bill.data$BillSection[1:100]), # for the first 100 records of the data.frame bill.data
function(x)grouped[grouped$Original==x,2])) # take the second column, i.e. Grouped, for the corresponding "TRUE" value in Original
cbind(bill.data[1:100,],as.data.frame(grouper))
The above code works as expected, but it's slow when I apply it to my whole database, which exceeds 10,000,000 unique records. Is there an alternative to this method? I know I can use plyr, but it's even slower (I think) than sapply. I was trying to figure it out with data.table but no luck. Any suggestions would be helpful. I am open to coding this in Python, which I am new to, but heard is much faster than R, since I am dealing with large datasets very often. I wanted to know if R can do this fast enough to be useful.
Thanks!
Upvotes: 0
Views: 166
Reputation: 3955
I'm not sure I understand your question, but can you use merge()
? i.e. something like...
merge(big.df, group.names.df, by.x='orginal.column.in.big.df',
by.y='original', all.x=T)
NB. Plyr has a parallel option...
Upvotes: 2