Sal
Sal

Reputation: 117

Max efficiency in removing duplicated rows in data frame

I have a very large data frame: more than 6 million rows, 28 variables of any type (num, factors, characters). I need to remove the duplicated rows. However, the only way to identify actual duplicates is to run the check on a large character variable (approx 1,000 to 2,000 characters in each observation). I could very well use the standard duplicated() function but I am not sure this is the most time efficient solution.

Is there any function or package that allows to efficiently do the job ? Thank you in advance for suggestions.

structure(list(city = c("New York", "New York", "New York", "Brussels", 
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351, 
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD", 
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.", 
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers", 
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop", 
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")

Upvotes: 1

Views: 144

Answers (2)

daniel
daniel

Reputation: 1246

An alternative, though not necessarily more efficient is to count data:

df <- structure(list(city = c("New York", "New York", "New York", "Brussels", 
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351, 
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD", 
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.", 
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers", 
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop", 
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")

# do the count
df[with(df, ave(paste(prodCategory, city), userID, FUN=function(x) length(unique(x))))==1,]


city prodCategory       date userID
2  New York            4 2014-10-09   XYZZ
5    London            4 2014-10-11   SDFG
6 Arlington            4 2014-10-12  WEDGD
                                                                                                                                                                                                          review
2            this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.
5             That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop
6 Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat

Upvotes: 1

akrun
akrun

Reputation: 886948

Try

library(data.table)
setkey(setDT(df), review)
res <- unique(df)
dim(res)
#[1] 5 5

Upvotes: 1

Related Questions