Achal Neupane
Achal Neupane

Reputation: 5719

How do I retrieve values from successive lines in perl?

I have this data below,called data.txt, I want to retrieve four columns from this data. First, I want to retrieve degradome category, then p-value, then the text before and after Query:. So the result should look like this(showing the first row only):

Degardome Category: 3  Degradome p-value: 0.0195958324320822   3' UGACGUUUCAGUUCCCAGUAU 5' Seq_3694_200

data.txt:

5' CCGGUAAGGUUAUGGGUCAUG 3' Transcript: Supercontig_2.8_1446328:1451-1471 Slice Site:1462
      |o||o||o| |||||||o
3' UGACGUUUCAGUUCCCAGUAU 5' Query: Seq_3694_200

SiteID: Supercontig_2.8_1446328:1462
MFE of perfect match: -36.10
MFE of this site: -23.60
MFEratio: 0.653739612188366
Allen et al. score: 7.5
Paired Regions (query5'-query3',transcript3'-transcript5')
    1-8,1471-1464
    10-18,1462-1454
Unpaired Regions (query5'-query3',transcript3'-transcript5')
    9-9,1463-1463   SIL: Symmetric internal loop
    19-21,1453-1451 UP3: Unpaired region at 3' of query

Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 3
Degradome p-value: 0.0195958324320822
T-Plot file: T-plots-IGR/Seq_3694_200_Supercontig_2.8_1446328_1462_TPlot.pdf

Position    Reads   Category
1462    4   3   <<<<<<<<<<
2949    7   3
4179    517 0
---------------------------------------------------
---------------------------------------------------

5' GGUGAGGAGGGGGGUUUG-GUC 3' Transcript: Supercontig_2.8_1511075:1311-1331 Slice Site:1323
    | |||||oo||| |||o |||
3' AC-CUCCUUUCCCGAAAUACAG 5' Query: Seq_2299_664

SiteID: Supercontig_2.8_1511075:1323
MFE of perfect match: -37.90
MFE of this site: -25.30
MFEratio: 0.66754617414248
Allen et al. score: 8
Paired Regions (query5'-query3',transcript3'-transcript5')
    1-3,1331-1329
    5-8,1328-1325
    10-19,1323-1314
    20-20,1312-1312
Unpaired Regions (query5'-query3',transcript3'-transcript5')
    4-4,x-x BULq: Bulge on query side
    9-9,1324-1324   SIL: Symmetric internal loop
    x-x,1313-1313   BULt: Bulge on transcript side
    21-21,1311-1311 UP3: Unpaired region at 3' of query

Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 4
Degradome p-value: 0.013385336399181

I tried to do this for before and after values, then I keep getting errors. Sorry I am new to perl and would really appreciate your help. Here are some of the codes I tried:

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use Modern::Perl;

my word = "Query:";

my $filename = $ARGV[0];
open(INPUT_FILE, $filename);
while (<<>>) {
chomp;
my ($before, $after) = m/(.+)(?:\t\Q$word\E:\t)(.+)/i;
say "word: $word\tbefore: $before\tafter: $after";
}

Upvotes: 2

Views: 78

Answers (1)

zdim
zdim

Reputation: 66873

Since you need straight pieces of data from each section, and both sections and data come clearly demarcated, the only question is of what data structure to use. Given that you want mere lines with values collected from each section a simple array should be fine.

It is known that the phrases of interest, Query: then Degardome Category: N then p-value, are unique to the context and places shown in the sample.

use warnings;
use strict;
use feature 'say';

my $file = shift || die "Usage $0 file\n";

open my $fh, '<', $file  or die "Can't open $file: $!";

my (@res, @query, $category, $pvalue);

while (<$fh>) {
    next if not /\S/;

    if (/(.*?)\s+Query:\s+(.*)/) {
        @query = ($1, $2);
        next;
    }   

    if (/^\s*(Degardome Category:\s+[0-9]+)/) {
        $category = $1; 
    }   
    elsif (/^\s*(Degradome p-value:\s+[0-9.]+)/) {
        $pvalue = $1; 
        push @res, [$category, $pvalue, @query];
    }   
}

say "@$_" for @res;

The end of a section is detected with the p-value: line, at which point we add to the @res an arrayref with all needed values captured up to that point.

The regex throughout depends on properties of data seen in the sample. Please review and adjust if some of my assumptions aren't right.

Details can also be pried from data more precisely, even by simply adding capture groups to the regexes above (and saving those captures into additional data structures).

Upvotes: 3

Related Questions