Abecee
Abecee

Reputation: 2393

RegEx to remove repeated start of line using TextWrangler

Trying to turn

a: 1, 2, 3
a: a, b, v
b: 5, 6, 7
b: 10, 1543, 1345
b: e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
e1: 1, 3, 2
e1: 9, 8, 7, 6

into

a: 1, 2, 3
   a, b, v
b: 5, 6, 7
   10, 1543, 1345
   e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
    1, 3, 2
    9, 8, 7, 6

So, the lines are sorted. If consecutive lines start with the same sequence of characters up to / including some separator (here the colon (and the blank following it)), only the first instance should be preserved - as should be the remainder of all lines. There could be up to about a dozen (and a half) lines starting with the identical sequence of characters. The input holds about 4,500 lines…

Tried in TextWrangler.

Whilst the search pattern

^([[:alnum:]]+): (.+)\r((\1:) (.+)\r)*

matches correctly, neither the replacement

\1:\t\2\r\t\3\r

nor

\1:\t\2\r\t\4\r

gets me anywhere close to what I'm looking for.

The search pattern

^(.+): (.+)\r((?<=\1:) (.+)\r)*

is rejected for the lookbehind not being fixed length. - Not sure, it's going into the right direction anyway, though.

Looking at How to merge lines that start with the same items in a text file I wonder, whether there is an elegant (say: one search pattern, one replacement, run once) solution at all.

On the other hand, I might just not be able to come up with the right question to search the net for. If you know better, please, point me into the right direction.

Keeping the remainder of the rows aligned is, of course, sugar on the cake…

Thank you for your time.

Upvotes: 14

Views: 1741

Answers (6)

dlamblin
dlamblin

Reputation: 45351

I tried your sample in Bare Bones Software Inc.'s TextWrangler and I came up with a two pass solution which is limited to n consecutive lines, and it uses a tab instead of trying to magically match the length of the prefix. Also note that the last line of the file should be an empty line (add a newline after , 6 in your example)

For our purposes I'm showing you where n=4:

Find: ^([[:alnum:]]+\:)(.+\r)(?:\1(.+\r))?\1(.+)\r
Replace: \1\2\t\3\t\4\t\5\r

You can add one to any n by duplicating a (?:\1(.+\r))? in Find and adding on \t\n before \r in Replace where *n* is the increment after the last number that was before that \r.

Replacing all with this, you can follow it up with:

Find: ^\t+
Replace: \t

To mostly get the result you want.

Upvotes: 1

Jonny 5
Jonny 5

Reputation: 12389

As a workaround for variable length lookbehind: PCRE allows alternatives of variable length

PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length.

An idea that requires to add a pipe for each character of max prefix length:

(?<=(\w\w:)|(\w:)) (.*\n?)\1?\2?

And replace with \t\3. See test at regex101. Capturing inside the lookbehind is important for not consuming / not skipping a match. Same pattern variable eg .NET: (?<=(\w+:)) (.*\n?)\1?

  • (?<=(\w\w:)|(\w:)) first two capture groups inside lookbehind for capturing prefix: Two or one word characters followed by a colon. \w is a shorthand for [A-Za-z0-9_]

  • (.*\n?) third capture group for stuff between prefixes. Optional newline to get the last match.

  • \1?\2? will optionally replace the same prefix if in the following line. Only one of both can be set: \1 xor \2. Also space after colon would always be matched - regardless prefix.


Summary: Space after each prefix is converted to tab. Prefix of following line only if matches current.
       To match and replace multiple spaces and tabs: (?<=(\w\w:)|(\w:))[ \t]+(.*\n?)\1?\2?

Upvotes: 6

CaHa
CaHa

Reputation: 1166

The problem with the substitution is the uncertain number of matches. When you limit that number e.g. to 12, you could use a regex like this:

^([^:]+): ([^\n]+[\n]*)(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?

with this replacement:

\n\1:\t\2\t\4\t\6\t\8\t\10\t\12\t\14\t\16\t\18\t\20\t\22\t\24

Explanation: it contains basically just two sub-regexes

  • ^([^:]+): ([^\n]+[\n]*) = matches on the first line of a group

  • (\1: ([^\n]+[\n]*))? = optional matches on consecutive lines, belonging to the same group. You have to copy this regex as often as needed to match all lines (i.e. in this case 12x). The ? (= optional) match won't give you an error if there aren't enough matches for all substitutions.

  • the \n at the beginning of the substitution is needed for a formatting issue

  • the result will contain a few empty lines, but I'm sure, you can solve that... ;-)

DEMO 1

However, since I'm not a fan of over-sized regexes - and for the case that you have a bigger number of potential matches - I would prefer a solution like this:

DEMO 2

Upvotes: 4

John Smith
John Smith

Reputation: 1099

The awk one-liner below will do what you want

awk -F: 'NR==1 {print $0} NR != 1 {if ($1 != prev) print $0; else {for (i=0; i<=length($1); ++i) printf " "; print $2;}} {prev=$1}' < input_file.txt

(put the original text into input_file.txt)

I believe it is possible to write a nicer code, but it is time to go to bed)

Upvotes: 1

Izzy
Izzy

Reputation: 272

So since you would like to replace all further instances aside from the first one, I'd assume you need regex to match everything but the first so you can replace them. Regular Expression as you know can not moddify or alter the original string, only return a specific match, which itself can be used to specify parts of the string to moddify.

The best regex I could come up with is /(\b[a-zA-Z0-9]+: )[^\n]+(?:\n|$)(?!\1)/g.

This will capture every unique instance of xx: and match the last instances of it. Only issue with this is that it'll still match the last instance even if it's the only instance.

My conclusion is that I don't believe you can do this all with regex. I may be wrong, if someone can find an online regex debugger that supports lookbehind AND backreferencing, let me know and I'll see if I can write an expression to work. I could not personally find any regex debuggers that accept backreferencing and lookbehind. In my example I use lookahead instead so it checks if there are any instances of it ahead, if so ignore the current match (so it selects only the last instance).

If you really wanted to find a way to automate this to make it work, use /(\b[a-zA-Z0-9]+: )/g to match every instance of xx:, store them all in an array and if there is a duplicate, run the original regex on that specific one to continue trimming it down until there are no more duplicates. Again you may be able to use it to store all unique instances and utilize that somehow.

Hope this helps or clarifies your problem, apologies if it doesn't.

Upvotes: 0

Tim.Tang
Tim.Tang

Reputation: 3188

Do not have Textwrangler to test, but I test this in other Regex Tool, it works well, please try:

(?<=(?:(?:.+\n)|^)(\w+?:).+\n)\1(?=\s)

Upvotes: -1

Related Questions