Peterdk
Peterdk

Reputation: 16015

Ruby regex ignores first valid(?) match

I am trying to manipulate SRT subtitle files. An example string @data of the start of the file:

1
00:01:09,611 --> 00:01:12,404
In co-production with

2
00:01:14,783 --> 00:01:17,034
presents

I was matching all the id's with a regex:

@data.scan(/^\d+\w*$/)

However, this ignored the first 1, and only output 2..900. I thought I missed some characters in the regex, and analyzed @data:

puts @data[0,10].inspect => "1\n00:01:09,611 --> "

I don't understand why this first 1 did not match. Also running it with @data.match() doesn't yield the 1 but the 2.

I then added a \n before the 1, and it worked. However, I don't understand why ^ needs a \n instead of a real start of the string.

Upvotes: 2

Views: 155

Answers (2)

dbenhur
dbenhur

Reputation: 20408

As @Dogbert points out in comments, you have a Unicode BOM at the beginning of your string. I suspect this is an artifact of whatever program is authoring the file you're reading. You can work around this a couple ways -- remove the character:

@data = @data[1..-1] if @data[0] == "\ufeff"
# or
@data.sub!(/\A\ufeff/, '')

Or make your scan regexp treat the BOM like a beginning of line anchor with a positive look-behind:

@data.scan(/(?:^|(?<=\ufeff))\d+\w*$/)

Or, as the Tin Man points out, tell ruby to be BOM-aware when reading the data:

@data = File.read('somedata', nil, 0, 'r:BOM|UTF-8')

Upvotes: 2

the Tin Man
the Tin Man

Reputation: 160551

If the problem is a BOM in the document, Ruby supports checking for a BOM along with using multibyte encodings when reading files. From the "IO Encoding" documentation for IO.new:

If “BOM|UTF-8”, “BOM|UTF-16LE” or “BOM|UTF16-BE” are used, ruby checks for a Unicode BOM in the input document to help determine the encoding. For UTF-16 encodings the file open mode must be binary. When present, the BOM is stripped and the external encoding from the BOM is used. When the BOM is missing the given Unicode encoding is used as ext_enc. (The BOM-set encoding option is case insensitive, so “bom|utf-8” is also valid.)

Upvotes: 3

Related Questions