dingo_d
dingo_d

Reputation: 11670

Matching patterns from a file returns multiple same outputs in bash

I'm trying to extract a list of files defined in my .gitattributes file in bash.

The .gitattributes file looks like this

#
# Exclude these files from release archives.
# This will also make them unavailable when using Composer with `--prefer-dist`.
# https://blog.madewithlove.be/post/gitattributes/
#
/.git export-ignore
/.github export-ignore
/bin export-ignore
/wp-content/themes/**/.storybook export-ignore
/wp-content/themes/**/assets export-ignore
/wp-content/themes/**/storybook export-ignore
/wp-content/themes/**/tests export-ignore
/wp-content/themes/**/.editorconfig export-ignore
/wp-content/themes/**/.env.testing export-ignore
/wp-content/themes/**/.eslintignore export-ignore
/wp-content/themes/**/.eslintrc export-ignore
/wp-content/themes/**/.gitignore export-ignore
/wp-content/themes/**/.stylelintrc export-ignore
/wp-content/themes/**/babel.config.js export-ignore
/wp-content/themes/**/composer.json export-ignore
/wp-content/themes/**/composer.lock export-ignore
/wp-content/themes/**/package.json export-ignore
/wp-content/themes/**/package-lock.json export-ignore
/wp-content/themes/**/phpcs.xml.dist export-ignore
/wp-content/themes/**/phpstan.neon export-ignore
/wp-content/themes/**/phpstan.neon.dist export-ignore
/wp-content/themes/**/postcss.config.js export-ignore
/wp-content/themes/**/webpack.config.js export-ignore
/wp-content/themes/**/CODE_OF_CONDUCT.md export-ignore

composer.lock -diff
yarn.lock -diff
package.lock -diff

#
# Auto detect text files and perform LF normalization
# http://davidlaing.com/2012/09/19/customise-your-gitattributes-to-become-a-git-ninja/
#
* text=auto

#
# The above will handle all files NOT found below
#
*.md text
*.php text
*.inc text

My bash script is inside the bin/ folder, and my .gitattributes is at the root of the project.

sh bin/test.sh path

The script looks like this

#!/bin/bash

#$1 - current_path variable (root)
file_list=()

while read -r line; do
  if [[ "$line" =~ (\/wp-content\/themes\/\*\*/) ]]; then
    newline=$(echo "$line" | sed 's/ export-ignore//p' | sed 's/\/wp-content\/themes\/\*\*\///p')
    file_list+=("$newline")
  fi
done <"$1"/.gitattributes

echo "${file_list[@]}"

But this will return me multiple duplicated files (four times). When I run this I get

.storybook
.storybook
.storybook
.storybook assets
assets
assets
assets storybook
storybook
storybook
storybook tests
tests
tests
tests .editorconfig
.editorconfig
.editorconfig
.editorconfig .env.testing
.env.testing
.env.testing
.env.testing .eslintignore
.eslintignore
.eslintignore
.eslintignore .eslintrc
.eslintrc
.eslintrc
.eslintrc .gitignore
.gitignore
.gitignore
.gitignore .stylelintrc
.stylelintrc
.stylelintrc
.stylelintrc babel.config.js
babel.config.js
babel.config.js
babel.config.js composer.json
composer.json
composer.json
composer.json composer.lock
composer.lock
composer.lock
composer.lock package.json
package.json
package.json
package.json package-lock.json
package-lock.json
package-lock.json
package-lock.json phpcs.xml.dist
phpcs.xml.dist
phpcs.xml.dist
phpcs.xml.dist phpstan.neon
phpstan.neon
phpstan.neon
phpstan.neon phpstan.neon.dist
phpstan.neon.dist
phpstan.neon.dist
phpstan.neon.dist postcss.config.js
postcss.config.js
postcss.config.js
postcss.config.js webpack.config.js
webpack.config.js
webpack.config.js
webpack.config.js CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md

Expected output:

.storybook
assets
storybook
tests
.editorconfig
.env.testing
.eslintignore
.eslintrc
.gitignore
.stylelintrc
babel.config.js
composer.json
composer.lock
package.json
package-lock.json
phpcs.xml.dist
phpstan.neon
phpstan.neon.dist
postcss.config.js
webpack.config.js
CODE_OF_CONDUCT.md

What am I doing wrong?

Upvotes: 2

Views: 147

Answers (3)

markp-fuso
markp-fuso

Reputation: 34034

As others will likely point out, there are other (simpler, more efficient) ways to do what the OP is looking to do; the objective of this answer is to address the behavior of the OP's current sed code.

By default sed will pass input through to stdout. Consider:

$ line='/wp-content/themes/**/.storybook export-ignore'
$ echo "${line}" | sed 's/ export-ignore//'
/wp-content/themes/**/.storybook

By adding the p directive to the sed command you are telling sed to print the result to stdout. Consider:

$ line='/wp-content/themes/**/.storybook export-ignore'
$ echo "${line}" | sed 's/ export-ignore//p'
/wp-content/themes/**/.storybook
/wp-content/themes/**/.storybook

As you can see we get 2 sets of output ... one set due to the normal behavior of sed ... one set due to the additional p directive.

If you want to use the p directive and eliminate the 'duplicate' output you can add the -n (aka --quiet/--silent) flag which disables sed's default behavior of passing input through to stdout. Consider:

$ line='/wp-content/themes/**/.storybook export-ignore'
$ echo "${line}" | sed -n 's/ export-ignore//p'
/wp-content/themes/**/.storybook

Because you have 2 sed commands using the p directive, while not using the -n flag, you end up with a total of 4 copies of each matching input (the first sed generating 2 lines of output; the second sed then doubling the output again).

To remove the 'duplicates' there are a couple options:

  • remove the p directive from both sed commands or ...
  • add the -n flag to both sed commands

Upvotes: 4

anubhava
anubhava

Reputation: 784938

This can be done using a simple awk:

awk -F/ 'index($0, "/wp-content/themes/") == 1 {sub(/ .*/, "", $NF); print $NF}' .gitattributes

.storybook
assets
storybook
tests
.editorconfig
.env.testing
.eslintignore
.eslintrc
.gitignore
.stylelintrc
babel.config.js
composer.json
composer.lock
package.json
package-lock.json
phpcs.xml.dist
phpstan.neon
phpstan.neon.dist
postcss.config.js
webpack.config.js
CODE_OF_CONDUCT.md

awk Explanation:

  • -F/: Use / as input field separator
  • index($0, "/wp-content/themes/") == 1: Line start with /wp-content/themes/ only
  • sub(/ .*/, "", $NF): Remove anything after space in last field
  • print $NF: Print last field

Upvotes: 3

Peter Forret
Peter Forret

Reputation: 29

The quick fix would be: just pipe the output through sort -u :-)

The root cause is your usage of the modifier 'p' in the sed regex. This prints out the extra copies. You can just leave it out gnu.org

If you need the results one filename per line, I would make the script

while read -r line; do
  if [[ "$line" =~ (\/wp-content\/themes\/\*\*/) ]]; then
    echo "$line" | sed 's/ export-ignore//' | sed 's/\/wp-content\/themes\/\*\*\///'
  fi
done <"$1"/.gitattributes

or, even better, with awk

< "$1/.gitattributes" awk '
/\/wp-content\/themes\/\*\*\// {
    gsub(/\/wp-content\/themes\/\*\*\//,"");
    gsub(/ export-ignore/,"");
    print $0;
}'

Upvotes: 2

Related Questions