loretoparisi
loretoparisi

Reputation: 16301

Regex in sed to match a subpath in a path with capturing groups

I have a list of dictionaries, made by two files named index with extension {aff,dic} like

dictionaries/dictionaries/bg_BG/index.dic
dictionaries/dictionaries/ca_ES/index.dic
dictionaries/dictionaries/cs_CZ/index.dic
dictionaries/dictionaries/da_DK/index.dic
...
dictionaries/dictionaries/bg_BG/index.aff
dictionaries/dictionaries/ca_ES/index.aff
dictionaries/dictionaries/cs_CZ/index.aff
dictionaries/dictionaries/da_DK/index.aff

and I want to copy them in a different folder, but naming each of the by the subpath like it_IT in order to have

myDicts/it_IT.dic
myDicts/it_IT.acc

I came out with this inline

for file in dictionaries/dictionaries/**/*.{dic,aff}; do echo ${file}; done

that lists the files in these folders, having in $file the for...loop variable dictionaries/dictionaries/da_DK/index.aff.

So using sed I was able to selected (in exclusion) those patterns like

sed 's:[a-z][a-z][_-][A-Z][A-Z]::';

so having

for file in dictionaries/dictionaries/**/*.{dic,aff}; do echo ${file} | sed 's:[a-z][a-z][_-][A-Z][A-Z]::'; done

that this time will print out

dictionaries/dictionaries//index.dic
dictionaries/dictionaries//index.dic
dictionaries/dictionaries//index.dic
...
dictionaries/dictionaries//index.aff
dictionaries/dictionaries//index.aff
dictionaries/dictionaries//index.aff

For my understanding I know that sed to print out the capture group needs to specify the captured group and the non capturing part - see here

But I was not able to figure out how to achieve this in order to have in $file at the end

bg_BG.acc
ca_ES.acc
da_DK.acc
...
bg_BG.dic
ca_ES.dic
da_DK.dic

where the extension {acc,dic} should be added as well. I need to execute this command inline for scripting reasons.

[UPDATE] Thanks to the answer below I came out with this solution

for file in dictionaries/dictionaries/**/*.{dic,aff}; do echo $file | sed 's:.*\([a-z][a-z][_-][A-Z][A-Z]\)/index\(.*\):cp & myDicts/\1\2:' | sh; done

that does its job:

$ ls myDicts/
bg_BG.aff cs_CZ.aff de_AT.aff de_DE.aff en_AU.aff en_GB.aff en_ZA.aff eu_ES.aff gl_ES.aff it_IT.aff mn_MN.aff nl_NL.aff pl_PL.aff pt_PT.aff ru_RU.aff sl_SI.aff sv_SE.aff uk_UA.aff
bg_BG.dic cs_CZ.dic de_AT.dic de_DE.dic en_AU.dic en_GB.dic en_ZA.dic eu_ES.dic gl_ES.dic it_IT.dic mn_MN.dic nl_NL.dic pl_PL.dic pt_PT.dic ru_RU.dic sl_SI.dic sv_SE.dic uk_UA.dic
ca_ES.aff da_DK.aff de_CH.aff el_GR.aff en_CA.aff en_US.aff es_ES.aff fr_FR.aff hr_HR.aff lb_LU.aff nb_NO.aff nn_NO.aff pt_BR.aff ro_RO.aff sk_SK.aff sr_RS.aff tr-TR.aff vi_VN.aff
ca_ES.dic da_DK.dic de_CH.dic el_GR.dic en_CA.dic en_US.dic es_ES.dic fr_FR.dic hr_HR.dic lb_LU.dic nb_NO.dic nn_NO.dic pt_BR.dic ro_RO.dic sk_SK.dic sr_RS.dic tr-TR.dic vi_VN.dic

There is only one pitfall that is it does not capture these path patterns

dictionaries/dictionaries/ca_ES-valencia/
dictionaries/dictionaries/sr_RS-Latn
dictionaries/dictionaries/ca_ES-valencia/
dictionaries/dictionaries/sr_RS-Latn/

Upvotes: 0

Views: 385

Answers (1)

webb
webb

Reputation: 4340

here's a way:

echo dictionaries/dictionaries/da_DK/index.aff |
  sed 's:.*\([^/]\+\)/index\(\..*\):\1\2:'

output:

da_DK.aff

however, there's a faster way than a for loop:

find dictionaries/dictionaries -name "index.dic" -or -name "index.aff" |
  sed 's:dictionaries/dictionaries/\([^/]\+\)/index\(\..*\):mv & myDicts/\1\2:'

if that produces the commands you want, pipe it to sh:

mkdir myDicts
find dictionaries/dictionaries -name "index.dic" -or -name "index.aff" |
  sed 's:dictionaries/dictionaries/\([^/]\+\)/index\(\..*\):mv & myDicts/\1\2:' |
  sh

Upvotes: 1

Related Questions