Mchief
Mchief

Reputation: 137

Notepad++ conditional replace

Task: Replace CRLF with space before lines where first alphabetic sequence does not consist of all capitals.

Text I have:

FOO Bar 123 sometext BAR sometext
Foobar, sometext 123
FOOBAR^&%# sometext sometext 1234 5678
Bar 123 456 FOO 789
barfoobar sometext
BAR; sometext (*&%#) FOOBAR 123

Expected result:

FOO Bar 123 sometext BAR sometext Foobar, sometext 123
FOOBAR^&%# sometext sometext 1234 5678 Bar 123 456 FOO 789 barfoobar sometext
BAR; sometext (*&%#) FOOBAR 123

Well, forgot to mention (if it matters at all), the source text in Russian (Cyrillic, Windows-1251), sample below.

AБИДЖАН (Abidjan) , город и главный порт государства Кот-д'Ивуар,
Aдминистративный центр деп. Абиджан. Ок. 2 млн. жителей 
Aдм. ц. французской колонии Берег Слоновой Кости (БСК). В 1960-83 столица Государства БСК.

Thanks very much for any help.

Cheers,

Michael

Upvotes: 1

Views: 192

Answers (3)

Toto
Toto

Reputation: 91430

  • Ctrl+H
  • Find what: \R(?![A-ZА-Я]+\b)
  • Replace with: A single space
  • CHECK Match case
  • CHECK Wrap around
  • CHECK Regular expression
  • Replace all

Explanation:

\R                  # any kind of linebreak (i.e. \r, \n, \r\n)
(?!                 # negative lookahead, make sure we haven't after:
    [A-ZА-Я]+           # Capital Latin & Cyrillic letters
    \b                  # word boundary, make sure we match a whole word
)                   # end lookahead

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

Upvotes: 0

Mchief
Mchief

Reputation: 137

After series of experiments I could develop 3-step solution.

  1. Search (\n[А-Я] ?[A-Я]+) , replace with \n#$1 (https://regex101.com/r/nVHqUt/1) .
  2. Search \r\n , replace with space.
  3. Search #\n , replace with \r\n.

Thanks everyone for your help!

Upvotes: 1

speciesUnknown
speciesUnknown

Reputation: 1753

Use regex replace, with unicode sequences

Open find and replace

Enable "Match case"

Set search mode to "Regular expression"

Find what: \r\n([\u0600-\u06FF]{0,1}[\u0061-\u007A]{1,})

Replace with: $1 (the space is important)

Upvotes: 1

Related Questions