BarneyL. BarStin
BarneyL. BarStin

Reputation: 343

How use match in ruby?

Im trying to get the uppercase words from a text. How i can use .match() for this? Example

text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"

and I need something like:

r = /[A-Z]/
puts r.match(text)

I never used match and i need a method that gets all uppercase words (Acronym).

Upvotes: 2

Views: 332

Answers (4)

the Tin Man
the Tin Man

Reputation: 160551

If you only want acronyms, you can use something like:

text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"

text.scan(/\b[A-Z]+\b/)
# => ["PS"]

It's important to match entire words, which is where \b helps, as it marks word boundaries.

The problem is when your text contains single, stand-alone capital letters:

text = "Pediatric stroke (PS) I U.S.A"

text.scan(/\b[A-Z]+\b/)
# => ["PS", "I", "U", "S", "A"]

At that point we need a bit more intelligence and foreknowledge of the text content being searched. The question is, are single-letter acronyms valid? If not, then a minor modification will help:

text.scan(/\b[A-Z]{2,}\b/)
# => ["PS"]

{2,} is explained in the Regexp documentation, so read that for more information.


i only want acronym type " (ACRONYM) ", in this case PS

It's not easy to tell what you want by your description. An acronym is defined as:

An acronym is an abbreviation used as a word which is formed from the initial components in a phrase or a word. Usually these components are individual letters (as in NATO or laser) or parts of words or names (as in Benelux).

according to Wikipedia. By that definition, lowercase, all caps and mixed case can be valid.

If, you mean you only want all-caps within parenthesis, then you can easily modify the regex to honor that, but you'll fail on other acronyms you could encounter, by either missing ones you should want, or by capturing others you should want to ignore.

text = "(PS) (CT/CAT scan)"
text.scan(/\([A-Z]+\)/) # => ["(PS)"]

text.scan(/\([A-Z]+\)/).map{ |s| s[1..-2] } # => ["PS"]

text.scan(/\(([A-Z]+)\)/) # => [["PS"]]
text.scan(/\(([A-Z]+)\)/).flatten # => ["PS"]

are varying ways grab the text but this only opens a new can of worms when you look at "List of medical abbreviations" and "Medical Acronyms / Abbreviations".

Typically I'd have a table of the ones I'll accept, use a simple pattern to capture anything that looks like something I'd want, check to see if it's in the table then keep it or reject it. How to do that is for you to figure out as it's a completely different question and doesn't belong in this one.

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110675

Yes, you can use String#match for this. It may not be the best way, but you didn't ask if it was. You'd have to do something like this:

text.split.map { |s| s.match(/[A-Z]\w*/) }.compact.map { |md| md[0] }
  #=> ["Pediatric", "PS"]

If you knew in advance that text contained two words beginning with a capital letter, you could write:

text.match(/([A-Z]\w*).*([A-Z]\w*)/)
[$1,$2]
  #=> ["Pediatric", "PS"]

Note that using a regex is not your only option:

text.delete('.,!?()[]{}').split.select { |str| ('A'..'Z').cover?(str[0]) }
  #=> ["Pediatric", "PS"]

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626816

To get all words that start with uppercase, use String#scan with \b\p{Lu}\w*\b:

text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
puts text.scan(/\b\p{Lu}\w*\b/).flatten

See demo

The String.match() will only get you the first match, while scan will return all matches.

The regex \b\p{Lu}\w*\b matches:

  • \b - word boundary
  • \p{Lu} - an uppercase Unicode letter
  • \w* - 0 or more alphanumeric characters
  • \b - a trailing word boundary

To only match linguistic words (made of letters) you can use

puts text.scan(/\b\p{Lu}\p{M}*+(?>\p{L}\p{M}*+)*\b/).flatten

See another demo

Here, \p{Lu}\p{M}*+ matches any Unicode uppercase letter (even a precomposed one as \p{M} matches diacritics) and (?>\p{L}\p{M}*+)* matches 0 or more letters.

To only get words in ALLCAPS, use

puts text.scan(/\b(?>\p{Lu}\p{M}*+)+\b/).flatten

See the 3rd demo

Upvotes: 1

Amadan
Amadan

Reputation: 198314

Wrong function for the job. Use String#scan.

Upvotes: 1

Related Questions