Reputation: 401
I have a dataframe df
as seen below with two features, a departing city and an arrival city. Every two rows information is stored about a going and a return flight.
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
7 K L
8 K L
There is some inconsistency in the data where the same flight is repeated as it can be seen in the last two rows.
How is it possible to compare for every two rows the departure city of the first row and the arrival city of the second row, and keep the ones that are equal. The dataset is very big and of course a for-loop is not considered an option.
Thank you in advance.
Upvotes: 0
Views: 4359
Reputation: 13581
Try the following solution, if it works for you:
df[duplicated(paste0(df$Departure,df$Arrival))==F,]
Upvotes: 0
Reputation: 3875
You could also do it this way:
right = rep(df[c(T,F),"Arrival"]==df[c(F,T),"Departure"],each=2)
df[right,]
This returns:
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
Upvotes: 1
Reputation: 814
This answer doesn't look for unique records, it specifically checks if a row is a duplicate of the row before.
Adding a new column with a 1 if the row has repeated:
for(i in 2:length(df$Departure)){df$test[i]=ifelse(df$Departure[i] == df$Departure[i-1] & df$Arrival[i] == df$Arrival[i-1], 1,0)}
Loops can be slow though:
library(data.table)
df$test2 = ifelse(df$Departure == shift(df$Departure) & df$Arrival == shift(df$Arrival), 1,0)
Upvotes: 0
Reputation: 38500
Here is a method that compares the pairs of rows using head
and tail
to line them up.
# find Departures that match the Arrival in the next row
sames <- which(head(dat$Departure, -1) == tail(dat$Arrival, -1))
# keep pairs of rows that match, maintaining order with `sort`
dat[sort(unique(c(sames, (sames + 1)))),]
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
Note that the two variables have to be character vectors, not factor variables. you can coerce them to character using as.character
if necessary.
data
dat <-
structure(list(Departure = c("A", "B", "F", "G", "U", "V", "K",
"K"), Arrival = c("B", "A", "G", "F", "V", "U", "L", "L")), .Names = c("Departure",
"Arrival"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8"))
Upvotes: 3
Reputation: 312
So you just want unique flight paths? there are a number of ways to do this, I'd think the fastest would be with data.table, something like:
library(data.table)
df <- as.data.table(df)
uniqueDf <- unique(df)
you can also use the duplicated function, something like
df <- df[!duplicated(df), ]
should do nicely.
Upvotes: 2