Reputation: 85

Common Lisp: Optimizing file parsing for minimum reads and memory allocations

I could use some help optimizing some Common Lisp code. I am attempting to query data out of a log file. Pulling the first 50 lines out of over 14.5k lines takes over a second. Extrapolating that out, it would take almost 5 minutes just to read the data from the log file. Additionally the first 50 lines with my currently implementation allocates ~50MB, when the entire file is only 14MB. Where I want to go with this is to perform 1 read through the data to parse it with the minimum number of memory allocations.

I know the performance hit I am seeing is due to my code. What I am having a hard time wrapping my brain around is how to refactor my code to minimize the issues I am seeing. I have tried accessing the string as a stream using WITH-INPUT-FROM-STRING and the performance didn't change noticeably.

This is an IIS log, so it will have a consistent structure. The first 2 fields are date and time, which I would like parsed into a number so I can constrain the range of data when needed. After that, most of the fields will be variable in size, but all are separated by a space.

With My Code: took 1,138,000 microseconds (1.138000 seconds) to run with 8 available CPU cores. During that period, 1,138,807 microseconds (1.138807 seconds) were spent in user mode 0 microseconds (0.000000 seconds) were spent in system mode 19,004 microseconds (0.019004 seconds) was spent in GC. 49,249,040 bytes of memory allocated.

Without My Code: took 64,000 microseconds (0.064000 seconds) to run with 8 available CPU cores. During that period, 62,401 microseconds (0.062401 seconds) were spent in user mode 0 microseconds (0.000000 seconds) were spent in system mode 834,512 bytes of memory allocated.

(defun read-date-time (hit)
  (let ((date-time (chronicity:parse (subseq hit 0 20))))
    (encode-universal-time (chronicity:sec-of date-time)
               (chronicity:minute-of date-time)
               (chronicity:hour-of date-time)
               (chronicity:day-of date-time)
               (chronicity:month-of date-time)
               (chronicity:year-of date-time))))

(defun parse-hit (hit)
  (unless (eq hit :eof)
    (cons (read-date-time hit)
          (split-sequence:split-sequence #\Space (subseq hit 20)))))


(time (gzip-stream:with-open-gzip-file (ins "C:\\temp\\test.log.gz") 
  (read-line ins nil :eof)
  (loop for i upto 50 
     do (parse-hit (read-line ins nil :eof)))))

My first attempt is a very naive approach and I recognize that my code now could use some improvement, so I am asking for some direction. If a tutorial is more a more appropriate way to answer this question, please post a link. I enjoy

Upvotes: 2

Answers (3)

user1086419

Reputation: 85

I made some modifications to the code and have cut the time it takes to execute and the number of bytes that are allocated significantly.

New Code: took 65,000 microseconds (0.065000 seconds) to run with 8 available CPU cores. During that period, 62,400 microseconds (0.062400 seconds) were spent in user mode 0 microseconds (0.000000 seconds) were spent in system mode 2,001 microseconds (0.002001 seconds) was spent in GC. 1,029,024 bytes of memory allocated.

(let ((date-time-string (make-array 20 :initial-element nil)))
  (defun read-date-time (hit)
    (read-sequence date-time-string hit :start 0 :end 20)
    (local-time:timestamp-to-universal (local-time:parse-timestring (map 'string #'code-char date-time-string) :date-time-separator #\Space))))

(defun parse-hit (hit)
  (cons (read-date-time hit) (split-sequence:split-sequence #\Space (read-line hit))))

(defun timeparse (lines)
  (time (gzip-stream:with-open-gzip-file (ins "C:\\temp\\test.log.gz")
      (read-line ins nil :eof) 
      (loop for i upto lines
         do (parse-hit ins)))))

The file unzipped is 14MB with around 14.5k lines in it.

Parsing 10k lines of the file results in an allocation of 178MB in 8.5 seconds. I believe this has to do with the gzip-stream library or one of its dependencies.

Without parsing and just performing a read-line on 10k lines from the compressed file takes 8 seconds and allocates 140MB of memory.

Without parsing and just performing a read-line on the same 10k lines from the uncompressed file takes about 1/10th of a second and only allocates 28MB.

There isn't much I can do about the performance of gzip-stream right now, so I will have to live with it until the performance is no longer tolerable.

Thank you everyone for the help and recommendations.

Upvotes: 0

user797257

Reputation:

Here's something I've tried for example, not a complete log parser, of course you may choose to represent the data differently / choose what is important to you, this is just the general idea:

(defparameter *log*
  "2012-11-04 23:00:04 10.1.151.54 GET /_layouts/1033/js/global.js v=5 443 - 10.1.151.61 Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/535.2+(KHTML,+like+Gecko)+Chrome/15.0.864.0+Safari/535.2 200 0 0 31
2012-11-04 23:00:04 10.1.151.54 GET /_layouts/1033/styles/css/topnav.css v=1 443 - 10.1.151.61 Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/535.2+(KHTML,+like+Gecko)+Chrome/15.0.864.0+Safari/535.2 200 0 0 62
2012-11-04 23:00:07 10.1.151.54 GET /pages/index.aspx - 80 - 10.1.151.59 - 200 0 64 374
2012-11-04 23:00:07 10.1.151.52 GET /pages/index.aspx - 80 - 10.1.151.59 - 200 0 64 374")

(defstruct iis-log-entry
  year month day hour minute second ip method path)

(defstruct ipv4 byte-0 byte-1 byte-2 byte-3)

(defun parse-ipv4 (entry &key start end)
  (let* ((b0 (position #\. entry))
         (b1 (position #\. entry :start (1+ b0)))
         (b2 (position #\. entry :start (1+ b1))))
    (make-ipv4
     :byte-0 (parse-integer entry :start start :end b0)
     :byte-1 (parse-integer entry :start (1+ b0) :end b1)
     :byte-2 (parse-integer entry :start (1+ b1) :end b2)
     :byte-3 (parse-integer entry :start (1+ b2) :end end))))

(defun parse-method (entry &key start end)
  ;; this can be extended to make use of END argument and other
  ;; HTTP verbs
  (case (char entry start)
    (#\G 'get)
    (#\P (case (char entry (1+ start))
           (#\O 'post)
           (#\U 'put)))))

(defun parse-path (entry &key start) (subseq entry start))

(defun parse-iis-log-entry (entry)
  (let* ((ip-end (position #\Space entry :start 27))
         (method-end (position #\Space entry :start ip-end)))
    (make-iis-log-entry
     :year (parse-integer entry :start 0 :end 4)
     :month (parse-integer entry :start 5 :end 7)
     :day (parse-integer entry :start 8 :end 10)
     :hour (parse-integer entry :start 11 :end 13)
     :minute (parse-integer entry :start 14 :end 16)
     :second (parse-integer entry :start 17 :end 19)
     :ip (parse-ipv4 entry :start 20 :end ip-end)
     :method (parse-method entry :start (1+ ip-end) :end method-end)
     :path (parse-path entry :start method-end))))

(defun test ()
  (time
   (loop :repeat 50000 :do
      (with-input-from-string (stream *log*)
        (loop
           :for line := (read-line stream nil nil)
           :while line
           :for record := (parse-iis-log-entry line))))))

(test)
;; Evaluation took:
;;   0.790 seconds of real time
;;   0.794000 seconds of total run time (0.792000 user, 0.002000 system)
;;   [ Run times consist of 0.041 seconds GC time, and 0.753 seconds non-GC time. ]
;;   100.51% CPU
;;   1,733,036,566 processor cycles
;;   484,002,144 bytes consed

Also, of course, you could use some macros to reduce the repetitive bits of the parse-* functions.

Upvotes: 0

Will Hartung

Reputation: 118641

The problem is the Chronicity package, and it's use of the Local Time package internally.

This:

   (encode-universal-time (chronicity:sec-of date-time)
           (chronicity:minute-of date-time)
           (chronicity:hour-of date-time)
           (chronicity:day-of date-time)
           (chronicity:month-of date-time)
           (chronicity:year-of date-time))))

Is crushing you.

chronicity:month-of calls local-time:timestamp-month. If you look at the code of that:

 (nth-value 1
         (%timestamp-decode-date
          (nth-value 1 (%adjust-to-timezone timestamp timezone))))

So, here's it's decoding the basic date (which appears to be an integer), twice, (once for the time zone and again for the month.

So you're decoding the same date, doing the same work, 6 times for each date. And those routines are consing up a storm.

You're also calling subseq twice.

So, seems to me you need to concentrate on the date parsing logic in this case, use something less generic. You don't have to validate the date (the logs are presumed accurate), and you don't need to convert to days/seconds/milliseconds since epoch, you just need the individual MDY, HMS data. You're going through all of that work with the current packages, and it's redundant once you create the universal-time.

You also, likely, don't care about the timezone.

Anyway, that's a start on what the issue is. It's not an I/O problem yet.

Upvotes: 3

Common Lisp: Optimizing file parsing for minimum reads and memory allocations

Answers (3)

Related Questions