heinheo
heinheo

Reputation: 565

split file into several sub files

The file I am working on looks like this

header
//
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5  67 657  78 67 8  5645 6 
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111

I need it to be split into four files. The first file is

[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);

The second file has to be

5 4 2 1 3 4 543 5  67 657  78 67 8  5645 6

The next file has to be

01010010101010101010101010101011111100011
11110100100101010101010101111010001000001
00000000000000011001100101010010101011111

so the header and the // have to be excluded before the first file, the absolute: line should be removed and the gthcont: shoudl not pop up as well. Ideally the script would just take the input name of the file and name the output as first_input, second_input and third_input...

the fourth file should have the numbers from within the brackets in the first file..in this case it woudl only be

25
29

so my current try ist

awk.awk

BEGIN{body=0}
!body && /^\/\/$/    {body=1}
body  && /^\[/       {print > "first_"FILENAME}
body  && /^pos/{$1="";print > "second_"FILENAME}
body  && /^[01]+/    {print > "third_"FILENAME}
body  && /^\[[0-9]+\]/ {
  print > "first_"FILENAME
  print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}

but is somehow duplicates the lines in the first file so it would be [25], [25], [29],[29]

Upvotes: 2

Views: 97

Answers (2)

Tom Fenech
Tom Fenech

Reputation: 74705

Some very minor changes to your script produce the desired output:

!body && /^\/\/$/              {body=1}
body  && sub(/^gthcont: */,"") {print > "second_"FILENAME}
body  && /^[01]+/              {print > "third_"FILENAME}
body  && /^\[[0-9]+\]/ {
    print > "first_"FILENAME
    print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}

The duplication problem was caused by the fact that you printed to the first file in two places.

I have used sub to remove the first part of the gthcont: line (and changed the pattern too). sub returns true if it makes any replacements, so you can use it as a test as well. The advantage of using a substitution rather than unsetting the first field is that you can also get rid of the leading white space from the line.

As pointed out in the comments, there is no need to initialise body, so I removed the BEGIN block too.

Upvotes: 2

bgoldst
bgoldst

Reputation: 35324

I would just use a shell function for this:

function split3 {
    if [[ $# -ne 1 ]]; then echo 'split3: error: require 1 argument.' >&2; return 1; fi;
    while read -r; do
        line=$REPLY;
        if [[ "$line" =~ ^\[([0-9]+)\]: ]]; then
            echo "$line" >&3;
            echo "${BASH_REMATCH[1]}" >&6;
        elif [[ "$line" =~ ^gthcont: ]]; then
            echo "${line#gthcont: }" >&4;
        elif [[ "$line" =~ ^\s*[01]+\s*$ ]]; then
            echo "$line" >&5;
        fi;
    done <"$1" 3>"first_$1" 4>"second_$1" 5>"third_$1" 6>"fourth_$1";
};
split3 input; echo $?;
## 0
cat first_input;
## [25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
## [29]:((962:0.000580339,930:0.000580339):0.00543993);
cat second_input;
## 5 4 2 1 3 4 543 5  67 657  78 67 8  5645 6
cat third_input;
## 01010010101010101010101010101011111100011
## 1111010010010101010101010111101000100000
## 00000000000000011001100101010010101011111
cat fourth_input;
## 25
## 29

Upvotes: 1

Related Questions