Reputation: 303253
I get URIs from Akamai's log files that include entries such as the following:
/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam
I would like to normalize all of these to the same entry, under the rules:
autho
and file
from it.?
../
should be removed.<fulldir>/../
should be removed.I would have thought that the URI
library for Ruby would cover this, but:
It does not remove a trailing ?
if the query string is emptied.
URI.parse('/foo?jim').tap{ |u| u.query='' }.to_s #=> "/foo?"
normalize
method does not clean up .
or ..
in the path.So, failing an official library, I find myself writing a regex-based solution.
def normalize(path)
result = path.dup
path.sub! /(?<=\?).+$/ do |query|
query.split('&').reject do |kv|
%w[ autho file ].include?(kv[/^[^=]+/])
end.join('&')
end
path.sub! /\?$/, ''
path.sub!(/^[^?]+/){ |path| path.gsub(%r{[^/]+/\.\.},'').gsub('/./','/') }
end
It happens to work for the test cases I've listed above, but with 450,000 paths to clean up I cannot hand check them all.
Upvotes: 5
Views: 2772
Reputation: 160549
Something that is REALLY important, like, ESSENTIAL to remember, is that a URL/URI is a protocol, a host, a file-path to a resource, followed by options/parameters being passed to the resource being referenced. (For the pedantic, there are other, optional, things in there too but this is sufficient.)
We can extract the path from a URL by parsing it using the URI class, and using the path
method. Once we have the path, we have either an absolute path or a relative path based on the root of the site. Dealing with absolute paths is easy:
require 'uri'
%w[
/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam
].each do |url|
uri = URI.parse(url)
path = uri.path
puts File.absolute_path(path)
end
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
Because the paths are file paths based on the root of the server, we can play games using Ruby's File.absolute_path
method to normalize the '.' and '..' away and get a true absolute path. This will break if there are more ..
(parent directory) than the chain of directories, but you shouldn't find that in extracted paths since that would also break the server/browser ability to serve/request/receive resources.
It gets a bit more "interesting" when dealing with relative paths but File is still our friend then, but that's a different question.
Upvotes: 3
Reputation: 24337
The addressable gem will normalize these for you:
require 'addressable/uri'
# normalize relative paths
uri = Addressable::URI.parse('http://example.com/foo/bar/../jim/jam')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
# removes trailing ?
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
# leaves empty parameters alone
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?jim')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam?jim"
# remove specific query parameters
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?autho=<randomstring>&file=jam')
cleaned_query = uri.query_values
cleaned_query.delete('autho')
cleaned_query.delete('file')
uri.query_values = cleaned_query
uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
Upvotes: 8