user630605
user630605

Reputation: 167

Read and parse multiple text files

Can anyone suggest a simple way of achieving this. I have several files which ends with extension .vcf . I will give example with two files In the below files, we are interested in

File 1:

38  107 C   3   T   6   C/T
38  241 C   4   T   5   C/T
38  247 T   4   C   5   T/C
38  259 T   3   C   6   T/C
38  275 G   3   A   5   G/A
38  304 C   4   T   5   C/T
38  323 T   3   A   5   T/A

File2:

38  107 C   8   T   8   C/T
38  222 -   6   A   7   -/A
38  241 C   7   T   10  C/T
38  247 T   7   C   10  T/C
38  259 T   7   C   10  T/C
38  275 G   6   A   11  G/A
38  304 C   5   T   12  C/T
38  323 T   4   A   12  T/A
38  343 G   13  A   5   G/A

Index file :

107
222
241
247
259
275
304
323
343

The index file is created based on unique positions from file 1 and file 2. I have that ready as index file. Now i need to read all files and parse data according to the positions here and write in columns. From above files, we are interested in 4th (Ref) and 6th (Alt) columns. Another challenge is to name the headers accordingly. So the output should be something like this.

Position    File1_Ref   File1_Alt   File2_Ref   File2_Alt
107 3   6   8   8
222         6   7
241 4   5   7   10
247 4   5   7   10
259 3   6   7   10
275 3   5   6   11
304 4   5   5   12
323 3   5   4   12
343         13  5

Upvotes: 0

Views: 995

Answers (2)

dogbane
dogbane

Reputation: 274612

You can do this using the join command:

# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged

# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged

# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report

Output:

$ cat report

Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222     6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5

Update:

Here is a script which can be used to combine multiple files.

The following assumptions have been made:

  • The index file is sorted
  • The vcf files are sorted on their second column
  • There are no spaces (or any other special characters) in filenames

Save the following script to a file e.g. report.sh and run it without any arguments from the directory containing your files.

#!/bin/bash

INDEX_FILE=index    # the name of the file containing the index data
REPORT_FILE=report  # the file to write the report to
TMP_FILE=$(mktemp)  # a temporary file

header="Position"   # the report header
num_processed=0     # the number of files processed so far 

# loop over all files beginning with "file". 
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
    echo "Processing $file"
    if [[ $num_processed -eq 0 ]]
    then
        # it's the first file so use the INDEX file in the join
        join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
    else
        # work out the output fields
        for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
        do
            outputFields="$outputFields,1.$j"
        done
        outputFields="$outputFields,2.4,2.6"

        # join this file with the current report
        join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
    fi
    ((num_processed++))
    header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
    mv "$TMP_FILE" "$REPORT_FILE"
done

# add the header to the report
echo "$header" | cat - "$REPORT_FILE"  > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"

# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE"  > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"

Upvotes: 3

Chris Charley
Chris Charley

Reputation: 6588

This Perl solution will handle 1 or more, (50), files.

#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;

my $path = '.';
my @file = qw/ o33.txt o44.txt /;
my @position = slurp('index.txt') =~ /\d+/g;
my %data;

for my $filename (@file) {
    open my $fh, "$path/$filename" or die "Can't open $filename $!";
    while (<$fh>) {
        my ($pos, $ref, $alt) = (split)[1, 3, 5];
        $data{$pos}{$filename} = [$ref, $alt];
    }
    close $fh or die "Can't close $filename $!";
}

my @head;
for my $file (@file) {
    push @head, "${file}_Ref", "${file}_Alt";
}

my $tb = Text::Table->new( map {title => $_}, "Position", @head);

for my $pos (@position) {
    $tb->load( [
                $pos,
                map $data{$pos}{$_} ? @{ $data{$pos}{$_} } : ('', ''), @file
               ]
    );
}
print $tb;

Upvotes: 0

Related Questions