Reputation: 439
I am using bash shell and am working with the human reference genome GRCh38. I have a list of files in a directory, one file for each chromosome. Now I need to search the list of file names. Seems trivial but the file names inconveniently have special characters. Example:
ls -1 ../GRCh38_chromosomes/
outputs the contents of the directory:
…
HLA-DRB1*13:01:01?HLA00797_13935_bp.fa
HLA-DRB1*13:02:01?HLA00798_13941_bp.fa
HLA-DRB1*14:05:01?HLA00837_13933_bp.fa
HLA-DRB1*14:54:01?HLA02371_13936_bp.fa
HLA-DRB1*15:01:01:01?HLA00865_11080_bp.fa
HLA-DRB1*15:01:01:02?HLA03453_11571_bp.fa
…
Im having difficulty searching for a particular filename (from withing a script) because the “?” character in particular seems to get replaced by “\t”. Example:
ls -1 ../GRCh38_chromosomes/ | perl -ne ' print $_; '
I expect the same output but instead get:
…
HLA-DRB1*13:01:01 HLA00797_13935_bp.fa
HLA-DRB1*13:02:01 HLA00798_13941_bp.fa
HLA-DRB1*14:05:01 HLA00837_13933_bp.fa
HLA-DRB1*14:54:01 HLA02371_13936_bp.fa
HLA-DRB1*15:01:01:01 HLA00865_11080_bp.fa
HLA-DRB1*15:01:01:02 HLA03453_11571_bp.fa
…
this is causing me a headache when I try a search such as
ls -1 ../GRCh38_chromosomes/ | perl -ne ' if ( $_ =~ /^\QHLA-DRB1*15:01:01:02?\E/ ) { print $_; } '
which should output:
HLA-DRB1*15:01:01:02?HLA03453_11571_bp.fa
but instead finds nothing. Ive also tried awk with the same problem and am wondering why they put special characters in the chromosome names for GRCh38? Any ideas how to deal with these problem characters?
Upvotes: 0
Views: 2398
Reputation: 189327
Your diagnostics are off. The problem is that ls
replaces the actual tab character with a question mark, but only when its standard output is a terminal.
This is one of the many reasons you should not use ls
in scripts at all.
You seem to be looking simply for
printf '%s\n' ../"HLA-DRB1*15:01:02"*
where printf '%s\n'
could be replaced by simply echo
, but I guess down the line you will actually want to use this wildcard expression in a for
loop or as the file name argument to a completely different command.
The quotes cause the first asterisk to be interpreted literally; the second asterisk, outside the quotes, is a wildcard which matches any string. (The regex asterisk, aka Kleene star, has different semantics still, and does not match itself - instead, it specifies zeor or more repetitions of the previous character or grouped expression.)
Upvotes: 2