Reputation: 565
The file I am working on looks like this
header
//
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
I need it to be split into four files. The first file is
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
The second file has to be
5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
The next file has to be
01010010101010101010101010101011111100011
11110100100101010101010101111010001000001
00000000000000011001100101010010101011111
so the header and the // have to be excluded before the first file, the absolute: line should be removed and the gthcont: shoudl not pop up as well. Ideally the script would just take the input name of the file and name the output as first_input, second_input and third_input...
the fourth file should have the numbers from within the brackets in the first file..in this case it woudl only be
25
29
so my current try ist
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
but is somehow duplicates the lines in the first file so it would be [25], [25], [29],[29]
Upvotes: 2
Views: 97
Reputation: 74705
Some very minor changes to your script produce the desired output:
!body && /^\/\/$/ {body=1}
body && sub(/^gthcont: */,"") {print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
The duplication problem was caused by the fact that you printed to the first file in two places.
I have used sub
to remove the first part of the gthcont:
line (and changed the pattern too). sub
returns true if it makes any replacements, so you can use it as a test as well. The advantage of using a substitution rather than unsetting the first field is that you can also get rid of the leading white space from the line.
As pointed out in the comments, there is no need to initialise body
, so I removed the BEGIN
block too.
Upvotes: 2
Reputation: 35324
I would just use a shell function for this:
function split3 {
if [[ $# -ne 1 ]]; then echo 'split3: error: require 1 argument.' >&2; return 1; fi;
while read -r; do
line=$REPLY;
if [[ "$line" =~ ^\[([0-9]+)\]: ]]; then
echo "$line" >&3;
echo "${BASH_REMATCH[1]}" >&6;
elif [[ "$line" =~ ^gthcont: ]]; then
echo "${line#gthcont: }" >&4;
elif [[ "$line" =~ ^\s*[01]+\s*$ ]]; then
echo "$line" >&5;
fi;
done <"$1" 3>"first_$1" 4>"second_$1" 5>"third_$1" 6>"fourth_$1";
};
split3 input; echo $?;
## 0
cat first_input;
## [25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
## [29]:((962:0.000580339,930:0.000580339):0.00543993);
cat second_input;
## 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
cat third_input;
## 01010010101010101010101010101011111100011
## 1111010010010101010101010111101000100000
## 00000000000000011001100101010010101011111
cat fourth_input;
## 25
## 29
Upvotes: 1