Reputation: 64004
I have a data that looks like this
1:SRX000566
Submitter: WoldLab
Study: RNASeq expression profiling for ENCODE project(SRP000228)
Sample: Human cell line GM12878(SRS000567)
Instrument: Solexa 1G Genome Analyzer
Total: 4 runs, 62.7M spots, 2.1G bases
Run #1: SRR002055, 11373440 spots, 375323520 bases
Run #2: SRR002063, 22995209 spots, 758841897 bases
Run #3: SRR005091, 13934766 spots, 459847278 bases
Run #4: SRR005096, 14370900 spots, 474239700 bases
2:SRX000565
Submitter: WoldLab
Study: RNASeq expression profiling for ENCODE project(SRP000228)
Sample: Human cell line GM12878(SRS000567)
Instrument: Solexa 1G Genome Analyzer
Total: 3 runs, 51.2M spots, 1.7G bases
Run #1: SRR002052, 12607931 spots, 416061723 bases
Run #2: SRR002054, 12880281 spots, 425049273 bases
Run #3: SRR002060, 25740337 spots, 849431121 bases
3:SRX012407
Submitter: GEO
Study: GSE17153: Illumina sequencing of small RNAs from C. elegans embryos(SRP001363)
Sample: Caenorhabditis elegans(SRS006961)
Instrument: Illumina Genome Analyzer II
Total: 1 run, 3M spots, 106.8M bases
Run #1: SRR029428, 2965597 spots, 106761492 bases
Is there a compact way to convert them into tabular format (tab separated). Hence 1 entry/row per chunk. In these case 3 rows.
I tried this but doesn't seem to work.
perl -laF/\n/ `-000ne"print join chr(9),@F" myfile.txt`
Upvotes: 1
Views: 375
Reputation: 27183
Just treat this as a normal parsing problem, and add a little state:
my @records;
my @current_record;
while( my $line = <> ) {
chomp;
if( length $line ) {
# Store record data
push @current_record, $line;
}
else {
# Start new record
push @records, [@current_record] if @current_record;
@current_record = ();
}
}
print join "\t", @$_ for @records;
This is untested and I need to go to bed. If it doesn't work, I'll have to look again tomorrow.
Upvotes: 1
Reputation: 342373
if you don't mind awk
$ awk -vRS= -vFS="\n" '{$1=$1}1' OFS="\t" file
1:SRX000566 Submitter: WoldLab Study: RNASeq expression profiling for ENCODE project(SRP000228) Sample: Human cell line GM12878(SRS000567) Instrument: Solexa 1G Genome Analyzer Total: 4 runs, 62.7M spots, 2.1G bases Run #1: SRR002055, 11373440 spots, 375323520 bases Run #2: SRR002063, 22995209 spots, 758841897 bases Run #3: SRR005091, 13934766 spots, 459847278 bases Run #4: SRR005096, 14370900 spots, 474239700 bases
2:SRX000565 Submitter: WoldLab Study: RNASeq expression profiling for ENCODE project(SRP000228) Sample: Human cell line GM12878(SRS000567) Instrument: Solexa 1G Genome Analyzer Total: 3 runs, 51.2M spots, 1.7G bases Run #1: SRR002052, 12607931 spots, 416061723 bases Run #2: SRR002054, 12880281 spots, 425049273 bases Run #3: SRR002060, 25740337 spots, 849431121 bases
3:SRX012407 Submitter: GEO Study: GSE17153: Illumina sequencing of small RNAs from C. elegans embryos(SRP001363) Sample: Caenorhabditis elegans(SRS006961) Instrument: Illumina Genome Analyzer II Total: 1 run, 3M spots, 106.8M bases Run #1: SRR029428, 2965597 spots, 106761492 bases
otherwise an equivalent of the above awk statement
#!/usr/bin/perl
$\ = "\n";
$/ = "\n\n";
while (<>) {
chomp;
@F = split(/\n/, $_);
print join("\t",@F);
}
Upvotes: 1