Reputation: 117
I have a very large data frame: more than 6 million rows, 28 variables of any type (num, factors, characters). I need to remove the duplicated rows. However, the only way to identify actual duplicates is to run the check on a large character variable (approx 1,000 to 2,000 characters in each observation).
I could very well use the standard duplicated()
function but I am not sure this is the most time efficient solution.
Is there any function or package that allows to efficiently do the job ? Thank you in advance for suggestions.
structure(list(city = c("New York", "New York", "New York", "Brussels",
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351,
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD",
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.",
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers",
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop",
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")
Upvotes: 1
Views: 144
Reputation: 1246
An alternative, though not necessarily more efficient is to count data:
df <- structure(list(city = c("New York", "New York", "New York", "Brussels",
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351,
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD",
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.",
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers",
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop",
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")
# do the count
df[with(df, ave(paste(prodCategory, city), userID, FUN=function(x) length(unique(x))))==1,]
city prodCategory date userID
2 New York 4 2014-10-09 XYZZ
5 London 4 2014-10-11 SDFG
6 Arlington 4 2014-10-12 WEDGD
review
2 this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.
5 That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop
6 Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat
Upvotes: 1
Reputation: 886948
Try
library(data.table)
setkey(setDT(df), review)
res <- unique(df)
dim(res)
#[1] 5 5
Upvotes: 1