Reputation: 4028
I am building a crawler to search my file system for specific documents containing specific information. However, the regex part is leaving me a little perplexed. I have a testfile on my desktop containing 'teststring' and a test credit card number '4060324066583245' and the code below will run properly and find the file containing teststring
:
require 'find'
count = 0
Find.find('/') do |f| # '/' for root directory on OS X
if f.match(/\.doc\Z/) # check if filename ends in desired format
contents = File.read(f)
if /teststring/.match(contents)
puts f
count += 1
end
end
end
puts "#{count} sensitive files were found"
Running this confirms that the crawler is working and properly finding matches. However, when I try to run it for finding the test credit card number it fails to find a match:
require 'find'
count = 0
Find.find('/') do |f| # '/' for root directory on OS X
if f.match(/\.doc\Z/) # check if filename ends in desired format
contents = File.read(f)
if /^4[0-9]{12}(?:[0-9]{3})?$/.match(contents)
puts f
count += 1
end
end
end
puts "#{count} sensitive files were found"
I checked the regex on rubular.com with 4060324066583245
as a piece of test data, which is contained in my test document, and Rubular verifies that the number is a match for the regex. To sum things up:
teststring
- verifying that the crawler is properly scanning my file system and reading contents of the desired file type4060324066583245
Any suggestions? I'm at a loss why Rubular shows the regex as working but the script won't work when run on my machine.
Upvotes: 1
Views: 206
Reputation: 336088
^
and $
are anchors that tie the match to the start and end of the string, respectively.
Therefore, ^[0-9]{4}$
will match "1234"
, but not "12345"
or " 1234 "
etc.
You should be using word boundaries instead:
if contents =~ /\b4[0-9]{12}(?:[0-9]{3})?\b/
Upvotes: 2