Reputation: 6032

How can I safely extract the URL in Emacs regex

I'm having a hell of a problem reliably extracting the URL from an HTTP header using regex. It's not helped by the header alternately arriving with and without ^M characters which don't seem to match the white space class. Currently the best I've managed so far is:

(re-search-forward "^x-url: .*/\\{2,3\\}\\(.*\\)" nil t)

But of course that also picks up the ^M if it exists as well as the URL paramters which I don't really need. To give you an example from my debugging:

x-url: http://wiki/mediawiki/index.php?title=Vsmux&action=edit&redlink=1
x-url: http://wiki/mediawiki/index.php?title=Vsmux&action=edit&redlink=1^M

What I really want in both cases is just the result:

wiki/mediawiki/index.php

Upvotes: 1

Answers (3)

stsquad

Reputation: 6032

For completeness I should probably add another solution I've tried based on discussion with @wvxvw about using a proper parser. This renders to elisp code looking a bit like this:

(save-excursion
  (let* ((url-string (url-get-url-at-point (re-search-forward "^x-url: ")))
         (url (url-generic-parse-url url-string))
         (arg-split (string-match-p "?" (url-filename url))))
    (format "%s%s" (url-host url)
        (if arg-split
            (substring (url-filename url) 0 arg-split)
          (url-filename url)))))

Upvotes: 2

user797257

Reputation:

This looks horrible, but I'm not responsible for how it looks - people who invented this idiotic standard are... But this should follow the standard (the old version, which didn't include Unicode characters and their translation) very closely:

"^x-url:\\s-*\\(\\w\\|\\+\\|-\\)+://\\(\\w\\|\\-\\)+\\(\\.\\w+\\)?\\(\\/\\(%[0-9a-fA-F]\\{2\\}\\|[~\\.A-Za-z_+-]*\\)*\\)*"

This is unless some "helpful" program already did translation from percent-encoded URI components into their original non-encoded form.

Also, there are some technical limits on how long the parts of the URL may be, I'm not going to try to implement that...

Also, it assumes that authentication scheme, like that in the basic authentication is never used. Otherwise it would be a whole lot easier to do it w/o regular expression.

Upvotes: 3

jtahlborn

Reputation: 53694

How about something like (this assumes all urls will have "://" in them):

(re-search-forward "^x-url: [^:]*://\\([^?\r\n]+\\).*?$")

Upvotes: 2

How can I safely extract the URL in Emacs regex

Answers (3)

Related Questions