Reputation: 8946
I have a table set up as follows
id
origin
destination
carrier_id
so typical row could be,
100: London Manchester 366
Now each route goes both ways, so there shouldn't be a row like this
233: Manchester London 366
since that's essentially the same route (for my purposes anyway)
Unfortunately though, i have wound up with a handful of duplicates. I have over 50,000 routes made up of around 2000 point of origin (or destination, however you want to look at it) in the table. So i'm thinking looping through each point of origin to find duplicates would be insane.
So I don't even know where to start trying to figure out a query to identify them. Any ideas?
Upvotes: 1
Views: 305
Reputation: 3200
I think you just need a double join, the following will identify all the "duplicate" records joined together.
Here's an example.
Say SELECT * FROM FLIGHTS
yielded:
id origin destination carrierid
1 toronto quebec 1
2 quebec toronto 2
3 edmonton calgary 3
4 calgary edmonton 4
5 hull vancouver 5
6 vancouveredmonton 6
7 edmonton toronto 7
9 edmonton quebec 8
10 toronto edmonton 9
11 quebec edmonton 10
12 calgary lethbridge 11
So there's a bunch of duplicates (4 of the routes are duplicates of some other route).
select *
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
would yield just the duplicates:
id origin destination carrierid id origin destination carrierid
1 toronto quebec 1 2 quebec toronto 2
2 quebec toronto 2 1 toronto quebec 1
3 edmonton calgary 3 4 calgary edmonton 4
4 calgary edmonton 4 3 edmonton calgary 3
7 edmonton toronto 7 10 toronto edmonton 9
9 edmonton quebec 8 11 quebec edmonton 10
10 toronto edmonton 9 7 edmonton toronto 7
11 quebec edmonton 10 9 edmonton quebec 8
At that point you just might delete all the ones that occurred 1st.
delete from flights
where id in (
select t1.id
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
)
Good luck!
Upvotes: 3
Reputation: 258128
If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366
You might be able to do something like this:
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:
So that you at least have the pairs grouped together. Not sure what would be the best move from there.
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester
where m.txt has these new contents:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90
Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.
If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.
Clear as mud?
Upvotes: 0
Reputation: 18506
Bummer! Off the top of my head (and in psuedo-sql):
select * from (
select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
union
select id, concat(destination, '_', origin, '_', carrier_id) as key from ....
) having count(key) > 1;
For the records above, you'd end up with:
100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366
That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.
Upvotes: 0