Reputation: 7255
File in:
1 34566 34765
2 45678 45789
Scaffold_3 34567 34799
Scaffold_X 67895 66900
Scaffold_Y 34567 34890
Note: There are lots of lines. I want to remove only underscore (_
) from the words in the first column. There should be no other changes except this. I am learning sed and Awk so any commands using these tools would be helpful. Additionally, explanation would also be helpful.
File out:
1 34566 34765
2 45678 45789
Scaffold3 34567 34799
ScaffoldX 67895 66900
ScaffoldY 34567 34890
Upvotes: 2
Views: 1271
Reputation: 52361
I've modified your input file a little to demonstrate that only the underscore in the first column is removed:
1 34_566 34765
2 45678 45_789
Scaffold_3 345_67 34799
Scaffold_X 678_95 66900
Scaffold_Y 345_67 34890
As for removing the underscore, I've used sed:
$ sed 's/^\([^ _]*\)_/\1/' infile
1 34_566 34765
2 45678 45_789
Scaffold3 345_67 34799
ScaffoldX 678_95 66900
ScaffoldY 345_67 34890
The command uses a substitution. We match all the characters that are neither spaces nor underscores and capture them:\([^ _]*\)
. This expression gets anchored at the start of the string (the first ^
) and is followed by an underscore.
We then replace it by what we have captured, but leave the underscore out (the \1
backreference in the replacement string).
Should there be more than one underscore in the first column, it gets a little tricky with sed. There are basically two options:
Here is an implementation of the first approach:
sed '
:a # Label to jump to
s/^\([^ _]*\)_/\1/ # Replace underscore in first column (like above)
ta # Jump to label if something was changed
' infile
And this is an implementation of the second approach:
sed '
h # Copy pattern space to hold space
s/^\([^ ]*\).*/\1/ # Remove everything but the first column
s/_//g # Delete all underscores
G # Append hold space to pattern space
# Replace old first column with underscore-free first column
s/^\(.*\)\n[^ ]*\(.*\)/\1\2/
' infile
The last step is the trickiest one. Before it, our pattern space looks like this (assuming an input file with multiple underscores in the first column):
ScaffoldY\nSca_ffold_Y 345_67 34890$
^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
New col 1 Old complete line
We replace the old first column with the new first column by smartly capturing and replacing:
ScaffoldY\nSca_ffold_Y 345_67 34890$
^^^^^^^^^ ^^^^^^^^^^^^^
\1 \2
So for an input file that looks like
1 34_566 34765
2 45678 45_789
Sca_ffold_3 345_67 34799
Sca_ffold_X 678_95 66900
Sca_ffold_Y 345_67 34890
we get output like this (with the command compressed to a single line):
$ sed 'h;s/^\([^ ]*\).*/\1/;s/_//g;G;s/^\(.*\)\n[^ ]*\(.*\)/\1\2/' infile
1 34_566 34765
2 45678 45_789
Scaffold3 345_67 34799
ScaffoldX 678_95 66900
ScaffoldY 345_67 34890
Notice that if the input files are not space separated, this won't work. The spaces in the bracket expressions have to be changed to reflect, e.g., tab separation. The first solution becomes
sed 's/^\([^[:blank:]_]*\)_/\1/' infile
the second one
sed ':a;s/^\([^[:blank:]_]*\)_/\1/;ta' infile
and the third one
sed 'h;s/^\([^[:blank:]]*\).*/\1/;s/_//g;G;s/^\(.*\)\n[^[:blank:]]*\(.*\)/\1\2/' infile
Upvotes: 2
Reputation: 3
Use the underscore as your field separator (-F) instead of the default whitespace:
awk -F'_' '{print $1$2}' file.txt
Upvotes: 0
Reputation: 3065
This awk one-liner should do the job:
awk '{gsub(/_/,"",$1)}1' input.txt
Output:
1 34566 34765
2 45678 45789
Scaffold3 34567 34799
ScaffoldX 67895 66900
ScaffoldY 34567 34890
Upvotes: 7