Reputation: 5719
I have this data below,called data.txt
, I want to retrieve four columns from this data. First, I want to retrieve degradome category, then p-value, then the text before and after Query:
. So the result should look like this(showing the first row only):
Degardome Category: 3 Degradome p-value: 0.0195958324320822 3' UGACGUUUCAGUUCCCAGUAU 5' Seq_3694_200
data.txt:
5' CCGGUAAGGUUAUGGGUCAUG 3' Transcript: Supercontig_2.8_1446328:1451-1471 Slice Site:1462
|o||o||o| |||||||o
3' UGACGUUUCAGUUCCCAGUAU 5' Query: Seq_3694_200
SiteID: Supercontig_2.8_1446328:1462
MFE of perfect match: -36.10
MFE of this site: -23.60
MFEratio: 0.653739612188366
Allen et al. score: 7.5
Paired Regions (query5'-query3',transcript3'-transcript5')
1-8,1471-1464
10-18,1462-1454
Unpaired Regions (query5'-query3',transcript3'-transcript5')
9-9,1463-1463 SIL: Symmetric internal loop
19-21,1453-1451 UP3: Unpaired region at 3' of query
Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 3
Degradome p-value: 0.0195958324320822
T-Plot file: T-plots-IGR/Seq_3694_200_Supercontig_2.8_1446328_1462_TPlot.pdf
Position Reads Category
1462 4 3 <<<<<<<<<<
2949 7 3
4179 517 0
---------------------------------------------------
---------------------------------------------------
5' GGUGAGGAGGGGGGUUUG-GUC 3' Transcript: Supercontig_2.8_1511075:1311-1331 Slice Site:1323
| |||||oo||| |||o |||
3' AC-CUCCUUUCCCGAAAUACAG 5' Query: Seq_2299_664
SiteID: Supercontig_2.8_1511075:1323
MFE of perfect match: -37.90
MFE of this site: -25.30
MFEratio: 0.66754617414248
Allen et al. score: 8
Paired Regions (query5'-query3',transcript3'-transcript5')
1-3,1331-1329
5-8,1328-1325
10-19,1323-1314
20-20,1312-1312
Unpaired Regions (query5'-query3',transcript3'-transcript5')
4-4,x-x BULq: Bulge on query side
9-9,1324-1324 SIL: Symmetric internal loop
x-x,1313-1313 BULt: Bulge on transcript side
21-21,1311-1311 UP3: Unpaired region at 3' of query
Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 4
Degradome p-value: 0.013385336399181
I tried to do this for before and after values, then I keep getting errors. Sorry I am new to perl and would really appreciate your help. Here are some of the codes I tried:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use Modern::Perl;
my word = "Query:";
my $filename = $ARGV[0];
open(INPUT_FILE, $filename);
while (<<>>) {
chomp;
my ($before, $after) = m/(.+)(?:\t\Q$word\E:\t)(.+)/i;
say "word: $word\tbefore: $before\tafter: $after";
}
Upvotes: 2
Views: 78
Reputation: 66873
Since you need straight pieces of data from each section, and both sections and data come clearly demarcated, the only question is of what data structure to use. Given that you want mere lines with values collected from each section a simple array should be fine.
It is known that the phrases of interest, Query:
then Degardome Category: N
then p-value
, are unique to the context and places shown in the sample.
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (@res, @query, $category, $pvalue);
while (<$fh>) {
next if not /\S/;
if (/(.*?)\s+Query:\s+(.*)/) {
@query = ($1, $2);
next;
}
if (/^\s*(Degardome Category:\s+[0-9]+)/) {
$category = $1;
}
elsif (/^\s*(Degradome p-value:\s+[0-9.]+)/) {
$pvalue = $1;
push @res, [$category, $pvalue, @query];
}
}
say "@$_" for @res;
The end of a section is detected with the p-value:
line, at which point we add to the @res
an arrayref with all needed values captured up to that point.
The regex throughout depends on properties of data seen in the sample. Please review and adjust if some of my assumptions aren't right.
Details can also be pried from data more precisely, even by simply adding capture groups to the regexes above (and saving those captures into additional data structures).
Upvotes: 3