Anconia
Anconia

Reputation: 4028

Regex issue with building a file system crawler

I am building a crawler to search my file system for specific documents containing specific information. However, the regex part is leaving me a little perplexed. I have a testfile on my desktop containing 'teststring' and a test credit card number '4060324066583245' and the code below will run properly and find the file containing teststring:

require 'find'
count = 0

Find.find('/') do |f|              # '/' for root directory on OS X
  if f.match(/\.doc\Z/)            # check if filename ends in desired format
    contents =  File.read(f)
      if /teststring/.match(contents) 
      puts f
      count += 1
    end
  end
end

puts "#{count} sensitive files were found"

Running this confirms that the crawler is working and properly finding matches. However, when I try to run it for finding the test credit card number it fails to find a match:

require 'find'
count = 0

Find.find('/') do |f|              # '/' for root directory on OS X
  if f.match(/\.doc\Z/)            # check if filename ends in desired format
    contents =  File.read(f)
      if /^4[0-9]{12}(?:[0-9]{3})?$/.match(contents) 
      puts f
      count += 1
    end
  end
end

puts "#{count} sensitive files were found"

I checked the regex on rubular.com with 4060324066583245 as a piece of test data, which is contained in my test document, and Rubular verifies that the number is a match for the regex. To sum things up:

  1. The crawler works on the first case using teststring - verifying that the crawler is properly scanning my file system and reading contents of the desired file type
  2. Rubular verifies that my regex successfully matches my test credit card number 4060324066583245
  3. The crawler fails to find the test credit card number.

Any suggestions? I'm at a loss why Rubular shows the regex as working but the script won't work when run on my machine.

Upvotes: 1

Views: 206

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336088

^ and $ are anchors that tie the match to the start and end of the string, respectively.

Therefore, ^[0-9]{4}$ will match "1234", but not "12345" or " 1234 " etc.

You should be using word boundaries instead:

if contents =~ /\b4[0-9]{12}(?:[0-9]{3})?\b/

Upvotes: 2

Related Questions