Reputation: 15
I just recently started to use powershell in win7 in order to produce pipeline like scripts for the program mothur. Before I used bash scripting in ubuntu to do this. I am happy that everything works well now except one task:
I would like to like to format a fasta file that is in the form:
filename.fasta:
>HXXC990
AGTTCAAGGTCTCT
>HXXC991
GGGTTTCAAATCTC
>HXXC992
GGGTCTCTCCTATA
To a file that is tab-delimited and looks like that
output.file:
HXXC990 filename
HXXC991 filename
HXXC992 filename
It is important that the first column of the output file contains the names without the ">"-signs. and the second by tab delimited column the original filename.fasta without the suffix ("filename"). I have the solutions gci to read out the base name of the file and Select-String to output all the lines beginning with ">". The only problem remains the formatting in the two columns and the constant repetition of the file name in the second column.
I've tried so far:
Select-String '>' .\filename.fasta | % {$_.Line} | set-content output.txt
to produce a file containing only the lines that contain the ">" signs. Afterwards I just replaced them. The file name I've got by
$base1 = gci filename.fasta | % {$_.BaseName}
Upvotes: 0
Views: 588
Reputation: 68303
Here's another solutions, showing some different options for the operations involved:
gci *.fasta | select-string '^>(.+)' |
% {"{0}`t{1}" -f $_.matches.groups[1],$_.filename.split('.')[0]} |
Set-Content output.file
Upvotes: 0
Reputation: 200323
Try this:
select-string '^>' filename.fasta | % {
$_ -replace '^.*\\(.*?)\.fasta:\d+:>(.*)$', "`$2`t`$1"
} > output.file
Note that your regular expression should be ^>
, not just >
. The latter would match >
anywhere in a line.
This can be applied to more than a single file like this:
$recurse = $false
Get-ChildItem "C:\base\folder" -Filter *.fasta -Recurse:$recurse `
| select-string '^>' `
| % { $_ -replace '^.*\\(.*?)\.fasta:\d+:>(.*)$', "`$2`t`$1" } > output.file
Upvotes: 0