Reputation: 149
I have a large tab delimited two column file that has the coordinates of many biochemical pathways like this:
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
I want to combine the lines if column 1 in one line is equal to column 2 in another line resulting in the following output:
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
I would like to use something simple such as an awk 1 liner, does anyone have any idea how I would approach this without writing a shell script? Any help is appreciated. I am trying to get each step and each subsequent step in each pathway. As these pathways often intersect some steps are shared by other pathways but I want to analyse each separately.
I have tried a shell script where I try to grep out any column where $2 = $1 later in the file:
while [ -s test ]; do
grep -m1 "^" test > i
cut -f2 i | sed 's/^/"/' | sed 's/$/"/' | sed "s/^/awk \'\$1 == /" | sed "s/$/' test >> i/" > i.sh
sh i.sh
perl -p -e 's/\n/\t/g' i >> OUT
sed '1d' test > i ; mv i test
done
I know that my problem comes from (a) deleting the line and (b) the fact that there are duplicates. I am just not sure how to tackle this.
Upvotes: 3
Views: 415
Reputation: 16997
Input
$ cat f
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
Output
$ awk '{
for(j=1; j<=NF; j+=2)
{
for(i=j;i<=NF;i+=2)
{
printf("%s%s", i==j ? $i OFS : OFS,$(i+1));
if($(i+1)!=$(i+2)){ print ""; break }
}
}
}' RS= OFS="\t" f
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
One liner
awk '{ for(j=1; j<=NF; j+=2)for(i=j;i<=NF;i+=2){printf("%s%s", i==j ? $i OFS : OFS,$(i+1)); if($(i+1)!=$(i+2)){ print ""; break }}}' RS= OFS="\t" f
Upvotes: 3
Reputation: 44921
$ <f.txt tac | awk 'BEGIN{OFS="\t"}{if($2==c1){$2=$2"\t"c2};print $1,$2;c1=$1;c2=$2}' | tac
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
Upvotes: 0
Reputation: 930
Well, you could put this on one line, but I wouldn't recommend it :)
#!/usr/bin/awk -f
{
a[NR] = $0
for(i = 1; i < NR; i++){
if(a[i] ~ $1"$")
a[i] = a[i] FS $2
if(a[i] ~ "^"$1){
for(j = i; j < NR; j++){
print a[j]
delete a[j]
}
}
}
}
END{
for(i = 1; i <= NR; i++)
if(a[i] != "")
print a[i]
}
Upvotes: 0