Reputation: 13
We have recently exported patient records from our old EMR system, trouble is every note for every patient came out as it's own PDF file resulting in 876,000+ PDFs in one directory, all with a long, cumbersome file name format of ID#-record#.YYYY-MM-DD HH.MM.SS.FIRSTNAME LASTNAME.TYPE OF NOTE.pdf
My first goal is to get to get all the files into patient directories labeled by ID# FIRSTNAME LASTNAME
ie: for the file labeled
345-1.2011-02-3 08.59.53.JOHN DOE.General Miscellaneous Service.pdf
a directory called 345-JOHN DOE
would be created and any files that start with 345
would be put into it.
I know I can use a script like
for file in ./*_???ILN*; do
dir=${file%ILN*}
dir=${dir##*_}
mkdir -p "./$dir" &&
mv -iv "$file" "./$dir"
done
Which in this example would take the value between the _ and ILN and create a directory on just that value. But how, if possible, can I take the ID# value and the FIRSTNAME LASTNAME value to create a directory?
Upvotes: 1
Views: 108
Reputation: 85767
You could use a regex like this:
for i in *.pdf; do
if [[ "$i" =~ ^([0-9]+)-[0-9]+\.[0-9]{4}-[0-9]{2}-[0-9]{1,2}\ [0-9]{2}\.[0-9]{2}\.[0-9]{2}\.([^.]+)\. ]]; then
id="${BASH_REMATCH[1]}"
name="${BASH_REMATCH[2]}"
subdir="$id-$name"
mkdir -p -- "$subdir"
mv -- "$i" "$subdir"
else
echo "couldn't parse file name: $i" >&2
fi
done
Bash (since version 3) supports the =~
(regex match) operator in [[ ]]
, which places the substrings captured by ( )
groups in the BASH_REMATCH
array. This is very convenient for extracting information from formatted strings.
Note that this will effectively group files by their ID/name combination, not just ID. This means if you have files that have the same ID, but a different name, they will be put in different subdirectories.
Upvotes: 1