lupl
lupl

Reputation: 944

How to extract XML processing instructions in Emacs Lisp?

I would like to extract the processing instructions (particularly xml-model) from an XML file; yet both (n)xml-parse-file as well as libxml-parse-xml-region do not recognize processing instructions.

Is there a clean way to extract processing instructions or do I have to regex search for PIs?

edit: Here is a first draft of the functionality I was looking for:

(cl-defun extract-processing-instructions (&rest processing-instructions)
  "Extracts all/only the specified xml processing instructions from the current buffer and returns them as a list of string."
  (interactive)
  (let ((pi-re
     (format "<\\?\\(%s\\).*\\?>" (string-join processing-instructions "\\|")))
    (result))
    (save-excursion
      (goto-char (point-min))
      (while (re-search-forward pi-re nil t)
    (push (match-string 0) result)))
    (nreverse result)))

(cl-defun pi-str2sexp (pi-str)
  "Takes a processing instruction as a string and transforms it to a sexp-structure (in the style of xml-parse-*)."
  (let (sexp attr-alist)
    (save-match-data
      ;; get and push pi-element-name
      ;; (string-match "<\\?\\([[:alnum:]-]*\\)" pi-str)
      (string-match "<\\?\\([[:alnum:]-]*\\)" pi-str)
      (push (make-symbol (match-string 1 pi-str)) sexp)
      ;; construct attribute alist
      (while (string-match "\\([[:alnum:]-]*\\)=\"\\([^ ]*\\)\""
               pi-str (match-end 0))
    (push (cons (make-symbol (match-string 1 pi-str))
            (match-string 2 pi-str))
          attr-alist)))
    ;; finally: push attr alist and return sexp
    (push (nreverse attr-alist) sexp)
    (nreverse sexp)))

edit 2: Turns out advicing/generally building upon xml-parse-* in this matter (like suggested by @Tom Regner) is a huge pain. :(

The thing I came up with was a context manager, the idea was to use it to around-advice string-parse-tag-1 (which is at the heart of xml-parse-* (of course stand-alone use is also an option):

(cl-defmacro --replace-first-group (regex-replace-alist)
  `(save-excursion
     (dolist (expression ,regex-replace-alist)
       (goto-char (point-min))
       (replace-regexp (car expression) (cadr expression)))))

(cl-defmacro with-parsable-pi (buffer &body body)
  "Context manager that treats xml processing instructions in BUFFER as normal elements."
  (declare (indent defun))
  `(let ((old-buffer ,buffer))
     (with-temp-buffer
       (insert-buffer-substring old-buffer)
       (goto-char (point-min))
       (--replace-first-group '(("\\(\\?\\)>" "/>") ("<\\(\\?\\)" "<")))
       ,@body)))

This e.g. allows calls like

(with-parsable-pi (current-buffer)
  (xml-parse-tag-1))

so it is at least possible to get an element at a time; but since the XML exposed in the context manager isn't actually valid and xml-parse-* (rightfully) errors if invalid XML is encountered, it isn't possible to process more than one element at a time.

I was thinking of maybe introducing a pseudo root element or something, but the kludge spiral is ghastly enough as it is.

Another idea of course would be to run an xpath query to extract processing instructions. If there only was a solid xpath solution in Emacs Lisp..

Upvotes: 2

Views: 319

Answers (1)

lupl
lupl

Reputation: 944

Ok, I think I found a satisfactory solution: xmltok-forward-prolog!

So here is the code I came up with for extracting processing instructions:

(cl-defun filter-xmltok-prolog (&optional (buffer (current-buffer))
                     (filter-re "processing-instruction-.*"))
  "Filters the output of xmltok-forward-prolog (i.e. index 0 ('type') of each array) run in the context of BUFFER against FILTER-RE. Returns a list of vectors."
  (with-current-buffer buffer
    (save-excursion
      (goto-char (point-min))
      (let ((raw-prolog-data (xmltok-forward-prolog)))
    (seq-filter
     #'(lambda (x)
         (string-match filter-re (symbol-name (aref x 0))))
     raw-prolog-data)))))

(cl-defun --merge-pi-data (pi-data)
  "Meant to operate on data filtered with filter-xmltok-prolog against 'processing-instruction-.*'.
Merges processing-instruction-left/-right and returns a list of vectors holding the start/end coordinates of a processing instruction at index 1 and 2."
  (let ((left (car pi-data))
    (right (cadr pi-data)))
    (cond
     ((null pi-data) nil)
     (t (cons
     (vector 'processing-instruction
         (aref left 1) (aref right 2))
     (--merge-pi-data (cddr pi-data)))))))

;; test
(--merge-pi-data '([processing-instruction-left 40 51] [processing-instruction-right 52 126]))

(cl-defun pi-str2s-exp (pi-str)
  "Takes a processing instruction as a string and transforms it into a sexp structure (in the style of xml-parse-*)."
  (let (sexp attr-alist)
    (save-match-data
      ;; get and push pi-element-name
      (string-match "<\\?\\([[:alnum:]-]*\\)" pi-str)
      (push (make-symbol (match-string 1 pi-str)) sexp)
      ;; construct attribute alist
      (while (string-match "\\([[:alnum:]-]*\\)=\"\\([^ ]*\\)\""
               pi-str (match-end 0))
    (push (cons (make-symbol (match-string 1 pi-str))
            (match-string 2 pi-str))
          attr-alist)))
    ;; finally: push attr alist and return sexp
    (push (nreverse attr-alist) sexp)
    (nreverse sexp)))

(cl-defun get-processing-instructions (&optional (buffer (current-buffer)))
  "Extracts processing instructions from BUFFER and returns a list of sexp representations in the style of xml-parse-*."
  (save-excursion
    (mapcar #'pi-str2s-exp
     (mapcar #'(lambda (v)
           (buffer-substring (aref v 1) (aref v 2)))
       (--merge-pi-data (filter-xmltok-prolog buffer))))))


(cl-defun test/get-pis-from-file (file)
  (with-temp-buffer
    (insert-file-contents file)
    (get-processing-instructions)))

(test/get-pis-from-file "~/some/xml/file.xml")

I'm not at all an Emacs Lisp expert and this isn't at all tested thoroughly, but it works for now! :)

Upvotes: 1

Related Questions