lqdc
lqdc

Reputation: 531

Reading bytes from many files performance

So I have this code to check for filetype for each file in a directory. Just need to read first 4 bytes and check against pattern.

The code looks a little bit convoluted and really slow, but I can't figure out a faster way to do it in Nim.

What am I doing wrong?

  import os

  var
    buf {.noinit.}: array[4, char]

  let out_pat = ['{', '\\', 'r', 't']
  var
    flag = true
    num_read = 0

  var dirname = "/some/path/*"

  for path in walkFiles(dirname):
      num_read = open(path).readChars(buf, 0, 4)
      for i in 0..num_read-1:
        if buf[i] != out_pat[i]:
          flag = false
      if flag:
        echo path
      flag = true

for comparison, Python code that is 2x faster:

def find_rtf(dir_):
    for path in glob.glob(dir_):
        with open(path,'rb') as f:
            if f.read(4) == b'{\\rt':
                print(path)
find_rtf("/some/path/*")

and regular cli which is about 10x faster than Python but has some pipe bug when encountering 10^6+ files

time find ./ -type f -print0 | LC_ALL=C xargs -0 -P 6 -n 100 head -c 5 -v| grep "{\\\rt" -B 1

Upvotes: 3

Views: 555

Answers (1)

def-
def-

Reputation: 5403

On my system (Linux) the Nim version is twice as fast as the Python one. But maybe my files are just wrong. What operating system are you on?

You should close files and your comparison is wrong if the file is shorter than 4 bytes. Here's a minor cleanup:

import os

const
  out_pat = ['{', '\\', 'r', 't']
  dirname = "/some/path/*"

for path in walkFiles(dirname):
  var buf: array[4, char]
  let file = open(path)
  defer: close(file) # Always close file when it goes out of scope
  discard file.readChars(buf, 0, 4)
  if buf == out_pat:
    echo path

Make sure you compile with nim -d:release c foobar.nim.

The command line version is much faster as you use 6 processes at the same time. With -P 1 instead of -P 6 it is exactly as fast as the Nim version for me.

Upvotes: 4

Related Questions