Reputation: 919
Project:
Take Wikipedia's list of Roman consuls, put the data in a CSV so I can make a graph of the rise and fall of various gens in terms of consulage
Example data source:
509,L. Iunius Brutus,L. Tarquinius Collatinus
suff.,Sp. Lucretius Tricipitinus,P. Valerius Poplicola
suff.,M. Horatius Pulvillus,
508,P. Valerius Poplicola II,T. Lucretius Tricipitinus
507,P. Valerius Poplicola III,M. Horatius Pulvillus II
Vim search:
/\v(\d+|suff\.),((\w+\.=) (\w+)(\s\w+)=(\s\w+)=(\s[iv]+)=(\s\(.{-}\))=,=){,2}
So essentially:
(\d+|suff\.)
(outer group){,2}
(\w+.=)
(\w+)
(\s\w+)=
(\s\w+)=
(\s[iv]+)=
(\s\(.{-}\))=
(Last comma is optional since it's the end of the row.)
So the back references turn out to be:
\1: year or suffect
\2: the entire second outer group
\3: Praenomen of second outer group (same with all below)
\4: Nomen
\5: Cognomen
\6: Agnomen
\7: Iteration
\8: Explanatory note
The problem is I can't figure out how to capture that first outer group. It's like the \2 and \3-\8 references get overwritten when it sees that second outer group.
Using this replace:
:%s//1:{\1}^I2:{\2}^I3:{\3}^I4:{\4}^I5:{\5}^I6:{\6}^I7:{\7}^I8:{\8}^I9:{\9}
I get this output:
1:{509} 2:{L. Tarquinius Collatinus} 3:{L.} 4:{Tarquinius} 5:{ Collatinus} 6:{} 7:{} 8:{} 9:{}
1:{suff.} 2:{P. Valerius Poplicola} 3:{P.} 4:{Valerius} 5:{ Poplicola} 6:{} 7:{} 8:{} 9:{}
1:{suff.} 2:{M. Horatius Pulvillus,} 3:{M.} 4:{Horatius} 5:{ Pulvillus} 6:{} 7:{} 8:{} 9:{}
1:{508} 2:{T. Lucretius Tricipitinus} 3:{T.} 4:{Lucretius} 5:{ Tricipitinus} 6:{ II} 7:{} 8:{} 9:{}
1:{507} 2:{M. Horatius Pulvillus II} 3:{M.} 4:{Horatius} 5:{ Pulvillus} 6:{ II} 7:{} 8:{} 9:{}
I can't access those groups within the first outer group. I think they're being overwritten: are they being overwritten? If so, is there a way around this?
Edit: Original title Vim regex (or any compatible regex): how to reference a group (within a group) if the outer group is iterated?
Upvotes: 3
Views: 649
Reputation: 392903
I'd break it down in substeps, employing vim functions instead of doing it all the normal
(pun intended) way:
/\v(.{-}),(.{-}),(.*)
See what I did? made that much simpler and clearer
Edit Getting slightly less lazy, let's define a helper function to split into a minimum of 3 substrings and tabseparate them:
function! Consul(s)
return join((split(a:s) + ["","",""])[0:2], "\t")
endf
Now reduce the substitution to (linebreaks for SO only)
%s/\v(.{-}),(.{-}),(.*)/\=join(
[submatch(1), Consul(submatch(2)), Consul(submatch(3))], "\t")/g
Running that beauty on your input yields
509 L. Iunius Brutus L. Tarquinius Collatinus
suff. Sp. Lucretius Tricipitinus P. Valerius Poplicola
suff. M. Horatius Pulvillus
508 P. Valerius Poplicola T. Lucretius Tricipitinus
507 P. Valerius Poplicola M. Horatius Pulvillus
I'm pretty sure it will be a very easy step to further decorate the now neatly tab-separated columns to your liking. I might add it, but for now, here's simplest thing I can think of:
:%s/\v(.{-})\t(.{-})\t(.{-})\t(.{-})\t(.{-})\t(.{-})\t(.{-})$/1:{\1}\t2:{\2}\t3:{\3}\t4:{\4}\t5:{\5}\t6:{\6}\t7:{\7}/g
Result:
1:{509} 2:{L.} 3:{Iunius} 4:{Brutus} 5:{L.} 6:{Tarquinius} 7:{Collatinus}
1:{suff.} 2:{Sp.} 3:{Lucretius} 4:{Tricipitinus} 5:{P.} 6:{Valerius} 7:{Poplicola}
1:{suff.} 2:{M.} 3:{Horatius} 4:{Pulvillus} 5:{} 6:{} 7:{}
1:{508} 2:{P.} 3:{Valerius} 4:{Poplicola} 5:{T.} 6:{Lucretius} 7:{Tricipitinus}
1:{507} 2:{P.} 3:{Valerius} 4:{Poplicola} 5:{M.} 6:{Horatius} 7:{Pulvillus}
Upvotes: 4
Reputation: 20873
Yes, capturing groups within repetitions get overwritten to the most recent matched values. According to the Repetition and Backreferences section near the bottom of the linked page:
The regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten.
You'll have to explicitly write out a certain number of capturing groups.
I'm not specifically familiar with vim's regex engine, so here's a simple example.
Let's say your text is abc 12 345 6789 xyz
.
# with repetition
/^\w+( \d+){1,3} \w+$/
# yields:
# 0: abc 12 345 6789 xyz
# 1: 6789
# -----
# writing out each subpattern
/^\w+( \d+)( \d+)?( \d+)? \w+$/
# yields:
# 0: abc 12 345 6789 xyz
# 1: 12
# 2: 345
# 3: 6789
Note that with a repetition range of {1,3}
, I made the second and third ( \d+)
optional with ?
.
Upvotes: 3