Phrogz
Phrogz

Reputation: 303253

Normalize HTTP URI

I get URIs from Akamai's log files that include entries such as the following:

/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam

I would like to normalize all of these to the same entry, under the rules:

I would have thought that the URI library for Ruby would cover this, but:

So, failing an official library, I find myself writing a regex-based solution.

def normalize(path)
  result = path.dup
  path.sub! /(?<=\?).+$/ do |query|
    query.split('&').reject do |kv|
      %w[ autho file ].include?(kv[/^[^=]+/])
    end.join('&')
  end
  path.sub! /\?$/, ''
  path.sub!(/^[^?]+/){ |path| path.gsub(%r{[^/]+/\.\.},'').gsub('/./','/') }
end

It happens to work for the test cases I've listed above, but with 450,000 paths to clean up I cannot hand check them all.

Upvotes: 5

Views: 2772

Answers (2)

the Tin Man
the Tin Man

Reputation: 160549

Something that is REALLY important, like, ESSENTIAL to remember, is that a URL/URI is a protocol, a host, a file-path to a resource, followed by options/parameters being passed to the resource being referenced. (For the pedantic, there are other, optional, things in there too but this is sufficient.)

We can extract the path from a URL by parsing it using the URI class, and using the path method. Once we have the path, we have either an absolute path or a relative path based on the root of the site. Dealing with absolute paths is easy:

require 'uri'

%w[
  /foo/jim/jam
  /foo/jim/jam?
  /foo/./jim/jam
  /foo/bar/../jim/jam
  /foo/jim/jam?autho=<randomstring>&file=jam
].each do |url|
  uri = URI.parse(url)
  path = uri.path
  puts File.absolute_path(path)
end
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam

Because the paths are file paths based on the root of the server, we can play games using Ruby's File.absolute_path method to normalize the '.' and '..' away and get a true absolute path. This will break if there are more .. (parent directory) than the chain of directories, but you shouldn't find that in extracted paths since that would also break the server/browser ability to serve/request/receive resources.

It gets a bit more "interesting" when dealing with relative paths but File is still our friend then, but that's a different question.

Upvotes: 3

infused
infused

Reputation: 24337

The addressable gem will normalize these for you:

require 'addressable/uri'

# normalize relative paths
uri = Addressable::URI.parse('http://example.com/foo/bar/../jim/jam')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

# removes trailing ?
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

# leaves empty parameters alone
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?jim')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam?jim"

# remove specific query parameters
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?autho=<randomstring>&file=jam')
cleaned_query = uri.query_values
cleaned_query.delete('autho')
cleaned_query.delete('file')
uri.query_values = cleaned_query
uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

Upvotes: 8

Related Questions