eve_gr
eve_gr

Reputation: 23

Using sed to extract the middle of a line/filename

I have multiple files named:

Genus_species_strain.fasta

I want to use sed to print out:

Genus

species

strain

I want to use the "printed" words in a command like this (prokka is a tool for genome annotation):

prokka $file --outdir `echo $file | sed s/\.fasta//` --genus `echo $file | sed s/_.*\.fasta//` --species `echo $file | sed <something here>` --strain `echo $file | sed <something here>`

I would appreciate the help. I am very new to all of this, and as you see above, I only know how to print out Genus.

Below I have some additional questions (no need to answer these if it only complicates things further). This is one of my attempts to print species, and the questions are the following:

sed s/.*_//1 | sed s/_.*\.fasta//
  1. I know the second command isn't correct. I assume it needs to start from the second _, but I don't know how to do that, since the continuation (that is .fasta) is unique.

  2. When used alone, sed s/.*_//1 returns strain.fasta. How to make it not skip the first _?

  3. Combining commands (either as you see above, or with ;) doesn't seem to work for me.

Upvotes: 2

Views: 184

Answers (3)

Haru Suzuki
Haru Suzuki

Reputation: 142

One liners without setting multiple varibles Using sed capture groups: One liner

file='Genus_species_strain.fasta'
$(echo "$file" | sed "s/\(^[^_]*\)_\([^_]*\)_\([^_]*\)\.\(.*\)/prokka "$(echo "$file")" --outdir \4 --genus \1 --species \2 --strain \3/")

Using Bash string manipulation: One liner

file='Genus_species_strain.fasta'
$(echo prokka "$file" --outdir `echo "${file#*.}"` --genus `echo "${file%%_*}"` --species "$(echo `file=${file#*_} && echo "${file%%_*}"`)" --strain "$(echo `file=${file#*_} && file=${file#*_} && echo "${file%%.*}"`)")

Awk one liner

file='Genus_species_strain.fasta'
$(echo "$file" | awk -F [_\.] -v var="$file" '{print "prokka " $var " --outdir " $4 " --genus " $1 " --species " $2 " --strain " $4}')

Now you can use above commands within loop or with xargs with file variable pointing to filenames. It will create a prokka command and directly evaluates/executes it.

Hoping it works for you. Accept answer if it is more efficient

Upvotes: 1

sseLtaH
sseLtaH

Reputation: 11227

Using sed

$ file=path_to_file
$ sed "s/\(\([^_]*\)_\([^_]*\)_\([^.]*\)\).*/prokka $file --outdir \1 --genus \2 --species \3 --strain \4/e" <(echo *.fasta)

Output of command executed

prokka path_to_file --outdir Genus_species_strain --genus Genus --species species --strain strain

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626825

You can use string splitting with string manipulation:

file='Genus_species_strain.fasta'
IFS='[_.]' read -r genus species strain _ <<< "$file"
outdir="${file%.*}"

Then you can use the variables in the command:

prokka "$file" --outdir "$outdir" --genus "$genus" --species "$species" --strain "$strain"

See this online demo:

#!/bin/bash
file='Genus_species_strain.fasta'
IFS='[_.]' read -r genus species strain _ <<< "$file"
echo "${file%.*}" # outdir
echo "$genus"
echo "$species"
echo "$strain"

Output:

Genus_species_strain
Genus
species
strain

Upvotes: 2

Related Questions