user3031033
user3031033

Reputation: 21

How to get list of certain strings in a list of files using bash?

The title is maybe not really descriptive, but I couldn't find a more concise way to describe the problem.

I have a directory containing different files which have a name that e.g. looks like this:

{some text}2019Q2{some text}.pdf

So the filenames have somewhere in the name a year followed by a capital Q and then another number. The other text can be anything, but it won't contain anything matching the format year-Q-number. There will also be no numbers directly before or after this format.

I can work something out to get this from one filename, but I actually need a 'list' so I can do a for-loop over this in bash.

So, if my directory contains the files:

costumerA_2019Q2_something.pdf
costumerB_2019Q2_something.pdf
costumerA_2019Q3_something.pdf
costumerB_2019Q3_something.pdf
costumerC_2019Q3_something.pdf
costumerA_2020Q1_something.pdf
costumerD2020Q2something.pdf

I want a for loop that goes over 2019Q2, 2019Q3, 2020Q1, and 2020Q2.

EDIT:

This is what I have so far. It is able to extract the substrings, but it still has doubles. Since I'm already in the loop and I don't see how I can remove the doubles.

find original/*.pdf -type f -print0 | while IFS= read -r -d '' line; do
   echo $line | grep -oP '[0-9]{4}Q[0-9]'
done

Upvotes: 0

Views: 922

Answers (2)

Konstantin
Konstantin

Reputation: 547

Try this, in bash:

~ > $ ls
costumerA_2019Q2_something.pdf  costumerB_2019Q2_something.pdf
costumerA_2019Q3_something.pdf  other.pdf
costumerA_2020Q1_something.pdf  someother.file.txt

~ > $ for x in `(ls)`; do [[ ${x} =~ [0-9]Q[1-4] ]] && echo $x; done;
costumerA_2019Q2_something.pdf
costumerA_2019Q3_something.pdf
costumerA_2020Q1_something.pdf
costumerB_2019Q2_something.pdf

~ > $ (for x in *; do [[ ${x} =~ ([0-9]{4}Q[1-4]).+pdf ]] && echo ${BASH_REMATCH[1]}; done;) | sort -u
2019Q2
2019Q3
2020Q1

Upvotes: 0

KamilCuk
KamilCuk

Reputation: 142080

# list all _filanames_ that end with .pdf from the folder original
find original -maxdepth 1 -name '*.pdf' -type f -print "%p\n" |
# extract the pattern
sed 's/.*\([0-9]{4}Q[0-9]\).*/\1/' |
# iterate
while IFS= read -r file; do
    echo "$file"
done

I used -print %p to print just the filename, instead of full path. The GNU sed has -z option that you can use with -print0 (or -print "%p\0").

With how you have wanted to do this, if your files have no newline in the name, there is no need to loop over list in bash (as a rule of a thumb, try to avoid while read line, it's very slow):

find original -maxdepth 1 -name '*.pdf' -type f | grep -oP '[0-9]{4}Q[0-9]'

or with a zero seprated stream:

find original -maxdepth 1 -name '*.pdf' -type f -print0 |
grep -zoP '[0-9]{4}Q[0-9]' | tr '\0' '\n'

If you want to remove duplicate elements from the list, pipe it to sort -u.

Upvotes: 1

Related Questions