ThinkTeamwork
ThinkTeamwork

Reputation: 594

Extracting first word of line with regex in Ruby

I have this block of text:

XQuery programming language
C# programming language
declarative programming
XSLT programming language
Haskell programming language vs F* programming language

I want to retrieve the names of the programming languages.

I tried something like

matches = string.scan('/(\w)*\sprogramming language/i')

But that gives me this:

[]
[]
[]
[]

Whereas I want an array like this:

['XQuerye','C#','XSLT','Haskell']

What am I doing wrong?

Upvotes: 3

Views: 793

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

You need only make a couple of small changes to what you have. I've assumed the text you want always starts at the beginning of a line (because you've excluded 'F*') and is separated from "programming language"by one or more spaces.

text =<<_
XQuery programming language
C# programming language
declarative programming
XSLT programming language
Haskell programming language vs F* programming language
_

text.scan(/(^.+?)\s+programming language/i).flatten
  #=> ["XQuery", "C#", "XSLT", "Haskell"] 

Notes:

  • ^ in the regex is the beginning-of-line anchor. It needs to be inside the capture group (^.+). If we had ^(.+), nil would be returned by scan for the third line.
  • The first ? in the regex makes .+ "non-greedy". Without it, the last element of the array returned would be:

    "Haskell programming language vs F*"

  • In problems like this one you often have the choice between using a capture group (as here) or a lookaround (as @AvinashRaj did in his answer).

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174696

You must need to remove the quotes around the regex delimiter /

string.scan(/\S+(?=\sprogramming language)/i)

\S+ matches one or more non-space characters. (?=\sprogramming language) Positive lookahead which asserts that the match must be followed by a space and a programming language string. i modifier makes the regex engine to do a case-insensitive match.

DEMO

irb(main):001:0> str = "XQuery programming language
irb(main):002:0" C# programming language
irb(main):003:0" declarative programming
irb(main):004:0" XSLT programming language
irb(main):005:0" Haskell programming language vs F* programming language"
=> "XQuery programming language\nC# programming language\ndeclarative programming\nXSLT programming language\nHaskell programming language vs F* programming language"
irb(main):007:0> str.scan(/\S+(?=\sprogramming language)/i)
=> ["XQuery", "C#", "XSLT", "Haskell", "F*"]

Upvotes: 6

Related Questions