Output selected lines from a file as the first column and the file name as the second coulmn

Question

I just recently started to use powershell in win7 in order to produce pipeline like scripts for the program mothur. Before I used bash scripting in ubuntu to do this. I am happy that everything works well now except one task:

I would like to like to format a fasta file that is in the form:

filename.fasta:

>HXXC990
AGTTCAAGGTCTCT
>HXXC991
GGGTTTCAAATCTC
>HXXC992
GGGTCTCTCCTATA

To a file that is tab-delimited and looks like that

output.file:

HXXC990    filename
HXXC991    filename
HXXC992    filename

It is important that the first column of the output file contains the names without the ">"-signs. and the second by tab delimited column the original filename.fasta without the suffix ("filename"). I have the solutions gci to read out the base name of the file and Select-String to output all the lines beginning with ">". The only problem remains the formatting in the two columns and the constant repetition of the file name in the second column.

I've tried so far:

Select-String '>' .\filename.fasta | % {$_.Line} | set-content output.txt

to produce a file containing only the lines that contain the ">" signs. Afterwards I just replaced them. The file name I've got by

$base1 = gci filename.fasta | % {$_.BaseName}

Ansgar Wiechers · Accepted Answer

Try this:

select-string '^>' filename.fasta | % {
  $_ -replace '^.*\(.*?)\.fasta:\d+:>(.*)$', "`$2`t`$1"
} > output.file

Note that your regular expression should be ^>, not just >. The latter would match > anywhere in a line.

This can be applied to more than a single file like this:

$recurse = $false

Get-ChildItem "C:\base\folder" -Filter *.fasta -Recurse:$recurse `
  | select-string '^>' `
  | % { $_ -replace '^.*\(.*?)\.fasta:\d+:>(.*)$', "`$2`t`$1" } > output.file

Output selected lines from a file as the first column and the file name as the second coulmn

Answers (2)

Related Questions