Stefan Rohlfing
Stefan Rohlfing

Reputation: 43

Ruby Regex to Match Both Unix And Windows File Paths

The following instance method takes a file path and returns the file's prefix (the part before the separator):

@separator = "@"

def table_name path
  regex = Regexp.new("\/[^\/]+#{@separator}")
  path.match(regex)[0].gsub(/^.|.$/,'').downcase.to_sym
end

table_name "bla/bla/bla/[email protected]"
# => :prefix

So far, this method only works on Unix. To make it work on Windows, I also need to capture the backslash (\). Unfortunately, that's when I got stuck:

@separator = "@"

def table_name path
  regex = Regexp.new("(\/|\\)[^\/\\]+#{@separator}")
  path.match(regex)[0].gsub(/^.|.$/,'').downcase.to_sym
end

table_name("bla/bla/bla/[email protected]")
# RegexpError: premature end of char-class: /(\/|\)[^\/\]+@/

# Target result:
table_name("bla/bla/bla/[email protected]")
# => :prefix
table_name("bla\bla\bla\[email protected]")
# => :prefix

I suspect Ruby's string interpolation and escaping is what confuses me here.

How could I change the Regex to make it work on both Unix and Windows?

Upvotes: 4

Views: 3985

Answers (1)

sarnold
sarnold

Reputation: 104070

I don't actually know what bla/bla/bla/[email protected] refers to; is bla/bla/bla/bla all directories, and the filename [email protected]?

With the assumption that I've correctly understood your filenames, I suggest using File.split():

irb> (path, name) = File.split("bla/bla/bla/[email protected]")
=> ["bla/bla/bla", "[email protected]"]
irb> (prefix, postfix) = name.split("@")
=> ["Prefix", "invoice.csv"]

Not only is it platform-agnostic, it is more legible too.

Update

You piqued my curiosity:

>> wpath="blah\\blah\\blah\\[email protected]"
=> "blah\\blah\\blah\\[email protected]"
>> upath="bla/bla/bla/[email protected]"
=> "bla/bla/bla/[email protected]"
>> r=Regexp.new(".+[\\\\/]([^@]+)@(.+)")
=> /.+[\\\/]([^@]+)@(.+)/
>> wpath.match(r)
=> #<MatchData "blah\\blah\\blah\\[email protected]" 1:"Prefix" 2:"invoice.csv">
>> upath.match(r)
=> #<MatchData "bla/bla/bla/[email protected]" 1:"Prefix" 2:"invoice.csv">

You're right, the \ must be double-escaped for it to work in a regular expression: once to get past the interpreter, again to get past the regex engine. (Definitely feels awkward.) The regex is:

.+[\\/]([^@]+)@(.+)

The string is:

".+[\\\\/]([^@]+)@(.+)"

The regex, which might be too brittle for real use (how would it handle a path without / or \ path separators or a pathname without @ or with too many @?), looks for any number of characters, a single path separator, any amount of non-@, an @, then any amount of any characters. I'm assuming that the first .+ will greedily consume as many characters as possible to make the match as far to the right as possible:

>> evil_path="/foo/bar@baz/blorp/[email protected]"
=> "/foo/bar@baz/blorp/[email protected]"
>> evil_path.match(r)
=> #<MatchData "/foo/bar@baz/blorp/[email protected]" 1:"Prefix" 2:"invoice.csv">

But depending upon malformed input data, it might do the very wrong thing.

Upvotes: 6

Related Questions