Reputation: 107

Why am I getting incompatible encoding regexp match (UTF-8 regexp with IBM437 string)

DLVARScriptTesting.rb:175:in `sub!': incompatible encoding regexp match (UTF-8 regexp with IBM437 string) (Encoding::CompatibilityError)
    from DLVARScriptTesting.rb:175:in `block in parse_file'
    from DLVARScriptTesting.rb:171:in `each'
    from DLVARScriptTesting.rb:171:in `each_with_index'
    from DLVARScriptTesting.rb:171:in `parse_file'
    from DLVARScriptTesting.rb:371:in `<main>'

That's the full error.

Here are lines 171 & 175

File.readlines(testfile).each_with_index do |line, line_num|

    line.sub!(/^\xEF\xBB\xBF/, '') if line_num == 0

I've tried setting the encoding to utf-8 but that isn't working Basically what the code is trying to do is remove xEF xBB xBF before a string if it's there.

Upvotes: 0

Answers (1)

the Tin Man

Reputation: 160551

... Basically what the code is trying to do is remove xEF xBB xBF before a string if it's there.

Why not ignore a regex and use a substring match and a substring slice? Something like this untested code:

line[0, 3] = '' if line[0, 3] == "\xef\xbb\xbf"

Regular expressions are useful but they're hardly the replacement for string slicing and dicing. And, they can result in major slowdowns in code if the engine gets confused and has to do a lot of backtracking. So use them when they're appropriate and use Benchmark or Fruity to test the use of a Regular Expression against an equivalent operation using the regular String processing.

Also, as a scalability thing, don't do:

File.readlines(testfile).each_with_index

readlines reads an entire file into memory and converts it into an array. What's going to happen if your code moves from dev to production and the file being read suddenly goes from being 1K to 500MB? You'll see a major slowdown as Ruby tries to slurp the file and then convert it into an array in memory. In my world 500MB is small and multi-GB files are the norm.

Instead, use foreach, as in File.foreach(test file).with_index or better, don't bother with each_with_index or with_index and instead look at $. which is the current line number of the file being read. foreach reads the file line-by-line, which is just as fast or faster than slurping a file.

Upvotes: 1

Why am I getting incompatible encoding regexp match (UTF-8 regexp with IBM437 string)

Answers (1)

Related Questions