Ben
Ben

Reputation: 575

Need help extracting text while excluding other characters

Here the string:

Acanthite (Y: 1855) 02.BA.35 [18] [19] [20]
(IUPAC: Disilver sulfide)
Acetamide (1974-039) 10.AA.20 [21] [22] [23]
(IUPAC: Acetic acid amide)
Achalaite (2013-103) 04.?? [24] [no] [no]
Achavalite (Y: 1939

Here's my regex:

([^B35\[1-9\] 0:Y\(\)\n-.?])+

I've also tried:

^[a-z]+

What I would like outputted as a multi line is: (No particular programming language used)

Acanthite
Acetamide
Achalaite
Achavalite

Upvotes: 1

Views: 57

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

Since you have a multiline string as input and you need to remove everything but the first words on the lines starting with Latin letters, you can use the following trick:

  • Match and capture the first word on a line (thus, you need ^ start-of-string anchor together with /m multiline modifier)
  • Match the rest of the line and all the subsequence lines that do not start with a Latin letter.

The regex is:

(?im)^([a-z]+).*(\r?\n[^a-z].*)*

See the demo

The (?im) is the inline representation of m multiline and i ignorecase flags.

The regex breakdown:

  • ^ - start of line
  • ([a-z]+) - 1 or more Latin letters
  • .* - the rest of line
  • (\r?\n[^a-z].*)* - 0 or more sequences of...
    • \r?\n - newlines
    • [^a-z] - a symbol other than a Latin letter
    • .* - the rest of line

Note that to match and remove the non-welcome lines from the start of string, you need to add the (?:[^a-z].*\r?\n)* subpattern to the beginning:

(?im)^(?:[^a-z].*\r?\n)*([a-z]+).*(\r?\n[^a-z].*)*
       ^^^^^^^^^^^^^^^^^

See another demo

Upvotes: 1

james jelo4kul
james jelo4kul

Reputation: 829

use this pattern

A\w*e\s

See demo: https://regex101.com/r/hH8xD4/1

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174716

Just add case insensitive modifier. or You need to include A-Z inside the character class.

/^[a-z]+/im

or

(?im)^[a-z]+

or

(?m)^[a-zA-Z]+

Upvotes: 0

Related Questions