everestial
everestial

Reputation: 7255

How to remove a specific character only from the first word in a line for every column?

File in:

1 34566 34765 
2 45678 45789
Scaffold_3 34567 34799
Scaffold_X 67895 66900
Scaffold_Y 34567 34890

Note: There are lots of lines. I want to remove only underscore (_) from the words in the first column. There should be no other changes except this. I am learning sed and Awk so any commands using these tools would be helpful. Additionally, explanation would also be helpful.

File out:

1 34566 34765 
2 45678 45789
Scaffold3 34567 34799
ScaffoldX 67895 66900
ScaffoldY 34567 34890

Upvotes: 2

Views: 1271

Answers (3)

Benjamin W.
Benjamin W.

Reputation: 52361

I've modified your input file a little to demonstrate that only the underscore in the first column is removed:

1 34_566 34765
2 45678 45_789
Scaffold_3 345_67 34799
Scaffold_X 678_95 66900
Scaffold_Y 345_67 34890

As for removing the underscore, I've used sed:

$ sed 's/^\([^ _]*\)_/\1/' infile 
1 34_566 34765
2 45678 45_789
Scaffold3 345_67 34799
ScaffoldX 678_95 66900
ScaffoldY 345_67 34890

The command uses a substitution. We match all the characters that are neither spaces nor underscores and capture them:\([^ _]*\). This expression gets anchored at the start of the string (the first ^) and is followed by an underscore.

We then replace it by what we have captured, but leave the underscore out (the \1 backreference in the replacement string).


Multiple underscores in the first column

Should there be more than one underscore in the first column, it gets a little tricky with sed. There are basically two options:

  1. Try to replace an underscore in the first column (like above), repeat this action until no more changes take place so we know all the underscores in the first column are gone.
  2. Keep only the first column in the pattern space, replace all underscores globally, get the whole line back and replace the old with the new first column.

Here is an implementation of the first approach:

sed '
:a                  # Label to jump to
s/^\([^ _]*\)_/\1/  # Replace underscore in first column (like above)
ta                  # Jump to label if something was changed
' infile

And this is an implementation of the second approach:

sed '
h                    # Copy pattern space to hold space
s/^\([^ ]*\).*/\1/   # Remove everything but the first column
s/_//g               # Delete all underscores
G                    # Append hold space to pattern space

# Replace old first column with underscore-free first column
s/^\(.*\)\n[^ ]*\(.*\)/\1\2/
' infile

The last step is the trickiest one. Before it, our pattern space looks like this (assuming an input file with multiple underscores in the first column):

ScaffoldY\nSca_ffold_Y 345_67 34890$
^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^
New col 1      Old complete line

We replace the old first column with the new first column by smartly capturing and replacing:

ScaffoldY\nSca_ffold_Y 345_67 34890$
^^^^^^^^^             ^^^^^^^^^^^^^
    \1                      \2

So for an input file that looks like

1 34_566 34765
2 45678 45_789
Sca_ffold_3 345_67 34799
Sca_ffold_X 678_95 66900
Sca_ffold_Y 345_67 34890

we get output like this (with the command compressed to a single line):

$ sed 'h;s/^\([^ ]*\).*/\1/;s/_//g;G;s/^\(.*\)\n[^ ]*\(.*\)/\1\2/' infile 
1 34_566 34765
2 45678 45_789
Scaffold3 345_67 34799
ScaffoldX 678_95 66900
ScaffoldY 345_67 34890

Remark

Notice that if the input files are not space separated, this won't work. The spaces in the bracket expressions have to be changed to reflect, e.g., tab separation. The first solution becomes

sed 's/^\([^[:blank:]_]*\)_/\1/' infile

the second one

sed ':a;s/^\([^[:blank:]_]*\)_/\1/;ta' infile

and the third one

sed 'h;s/^\([^[:blank:]]*\).*/\1/;s/_//g;G;s/^\(.*\)\n[^[:blank:]]*\(.*\)/\1\2/' infile 

Upvotes: 2

mjm
mjm

Reputation: 3

Use the underscore as your field separator (-F) instead of the default whitespace:

awk -F'_' '{print $1$2}' file.txt

Upvotes: 0

F. Knorr
F. Knorr

Reputation: 3065

This awk one-liner should do the job:

awk '{gsub(/_/,"",$1)}1' input.txt

Output:

1 34566 34765 
2 45678 45789
Scaffold3 34567 34799
ScaffoldX 67895 66900
ScaffoldY 34567 34890

Upvotes: 7

Related Questions