Ruby regex ignores first valid(?) match

Question

I am trying to manipulate SRT subtitle files. An example string @data of the start of the file:

1
00:01:09,611 --> 00:01:12,404
In co-production with

2
00:01:14,783 --> 00:01:17,034
presents

I was matching all the id's with a regex:

@data.scan(/^\d+\w*$/)

However, this ignored the first 1, and only output 2..900. I thought I missed some characters in the regex, and analyzed @data:

puts @data[0,10].inspect => "1
00:01:09,611 --> "

I don't understand why this first 1 did not match. Also running it with @data.match() doesn't yield the 1 but the 2.

I then added a before the 1, and it worked. However, I don't understand why ^ needs a instead of a real start of the string.

dbenhur · Accepted Answer

As @Dogbert points out in comments, you have a Unicode BOM at the beginning of your string. I suspect this is an artifact of whatever program is authoring the file you're reading. You can work around this a couple ways -- remove the character:

@data = @data[1..-1] if @data[0] == "\ufeff"
# or
@data.sub!(/\A\ufeff/, '')

Or make your scan regexp treat the BOM like a beginning of line anchor with a positive look-behind:

@data.scan(/(?:^|(?<=\ufeff))\d+\w*$/)

Or, as the Tin Man points out, tell ruby to be BOM-aware when reading the data:

@data = File.read('somedata', nil, 0, 'r:BOM|UTF-8')

Ruby regex ignores first valid(?) match

Answers (2)

Related Questions