Reputation: 2393
Trying to turn
a: 1, 2, 3
a: a, b, v
b: 5, 6, 7
b: 10, 1543, 1345
b: e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
e1: 1, 3, 2
e1: 9, 8, 7, 6
into
a: 1, 2, 3
a, b, v
b: 5, 6, 7
10, 1543, 1345
e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
1, 3, 2
9, 8, 7, 6
So, the lines are sorted. If consecutive lines start with the same sequence of characters up to / including some separator (here the colon (and the blank following it)), only the first instance should be preserved - as should be the remainder of all lines. There could be up to about a dozen (and a half) lines starting with the identical sequence of characters. The input holds about 4,500 lines…
Tried in TextWrangler.
Whilst the search pattern
^([[:alnum:]]+): (.+)\r((\1:) (.+)\r)*
matches correctly, neither the replacement
\1:\t\2\r\t\3\r
nor
\1:\t\2\r\t\4\r
gets me anywhere close to what I'm looking for.
The search pattern
^(.+): (.+)\r((?<=\1:) (.+)\r)*
is rejected for the lookbehind not being fixed length. - Not sure, it's going into the right direction anyway, though.
Looking at How to merge lines that start with the same items in a text file I wonder, whether there is an elegant (say: one search pattern, one replacement, run once) solution at all.
On the other hand, I might just not be able to come up with the right question to search the net for. If you know better, please, point me into the right direction.
Keeping the remainder of the rows aligned is, of course, sugar on the cake…
Thank you for your time.
Upvotes: 14
Views: 1741
Reputation: 45351
I tried your sample in Bare Bones Software Inc.'s TextWrangler and I came up with a two pass solution which is limited to n consecutive lines, and it uses a tab instead of trying to magically match the length of the prefix. Also note that the last line of the file should be an empty line (add a newline after , 6
in your example)
For our purposes I'm showing you where n=4:
Find: ^([[:alnum:]]+\:)(.+\r)(?:\1(.+\r))?\1(.+)\r
Replace: \1\2\t\3\t\4\t\5\r
You can add one to any n by duplicating a (?:\1(.+\r))?
in Find
and adding on \t\n
before \r
in Replace
where *n* is the increment after the last number that was before that \r
.
Replacing all with this, you can follow it up with:
Find: ^\t+
Replace: \t
To mostly get the result you want.
Upvotes: 1
Reputation: 12389
As a workaround for variable length lookbehind: PCRE allows alternatives of variable length
PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length.
An idea that requires to add a pipe for each character of max prefix length:
(?<=(\w\w:)|(\w:)) (.*\n?)\1?\2?
And replace with \t\3
. See test at regex101. Capturing inside the lookbehind is important for not consuming / not skipping a match. Same pattern variable eg .NET: (?<=(\w+:)) (.*\n?)\1?
(?<=(\w\w:)|(\w:))
first two capture groups inside lookbehind for capturing prefix: Two or one word characters followed by a colon. \w
is a shorthand for [A-Za-z0-9_]
(.*\n?)
third capture group for stuff between prefixes. Optional newline to get the last match.
\1?\2?
will optionally replace the same prefix if in the following line. Only one of both can be set: \1
xor \2
. Also space after colon would always be matched - regardless prefix.
Summary: Space after each prefix is converted to tab. Prefix of following line only if matches current.
To match and replace multiple spaces and tabs: (?<=(\w\w:)|(\w:))[ \t]+(.*\n?)\1?\2?
Upvotes: 6
Reputation: 1166
The problem with the substitution is the uncertain number of matches. When you limit that number e.g. to 12, you could use a regex like this:
^([^:]+): ([^\n]+[\n]*)(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?
with this replacement:
\n\1:\t\2\t\4\t\6\t\8\t\10\t\12\t\14\t\16\t\18\t\20\t\22\t\24
Explanation: it contains basically just two sub-regexes
^([^:]+): ([^\n]+[\n]*)
= matches on the first line of a group
(\1: ([^\n]+[\n]*))?
= optional matches on consecutive lines, belonging to the same group. You have to copy this regex as often as needed to match all lines (i.e. in this case 12x). The ?
(= optional) match won't give you an error if there aren't enough matches for all substitutions.
the \n
at the beginning of the substitution is needed for a formatting issue
the result will contain a few empty lines, but I'm sure, you can solve that... ;-)
However, since I'm not a fan of over-sized regexes - and for the case that you have a bigger number of potential matches - I would prefer a solution like this:
combine all lines, belonging to the same group (as you already mentioned: How to merge lines that start with the same items in a text file). Within these steps, you can replace the group item by something unique (e.g. :@:
).
replace this unique item with \n\t
Upvotes: 4
Reputation: 1099
The awk one-liner below will do what you want
awk -F: 'NR==1 {print $0} NR != 1 {if ($1 != prev) print $0; else {for (i=0; i<=length($1); ++i) printf " "; print $2;}} {prev=$1}' < input_file.txt
(put the original text into input_file.txt)
I believe it is possible to write a nicer code, but it is time to go to bed)
Upvotes: 1
Reputation: 272
So since you would like to replace all further instances aside from the first one, I'd assume you need regex to match everything but the first so you can replace them. Regular Expression as you know can not moddify or alter the original string, only return a specific match, which itself can be used to specify parts of the string to moddify.
The best regex I could come up with is /(\b[a-zA-Z0-9]+: )[^\n]+(?:\n|$)(?!\1)/g
.
This will capture every unique instance of xx:
and match the last instances of it. Only issue with this is that it'll still match the last instance even if it's the only instance.
My conclusion is that I don't believe you can do this all with regex. I may be wrong, if someone can find an online regex debugger that supports lookbehind AND backreferencing, let me know and I'll see if I can write an expression to work. I could not personally find any regex debuggers that accept backreferencing and lookbehind. In my example I use lookahead instead so it checks if there are any instances of it ahead, if so ignore the current match (so it selects only the last instance).
If you really wanted to find a way to automate this to make it work, use /(\b[a-zA-Z0-9]+: )/g
to match every instance of xx:
, store them all in an array and if there is a duplicate, run the original regex on that specific one to continue trimming it down until there are no more duplicates. Again you may be able to use it to store all unique instances and utilize that somehow.
Hope this helps or clarifies your problem, apologies if it doesn't.
Upvotes: 0
Reputation: 3188
Do not have Textwrangler to test, but I test this in other Regex Tool, it works well, please try:
(?<=(?:(?:.+\n)|^)(\w+?:).+\n)\1(?=\s)
Upvotes: -1