Reputation: 23
I have a large text file with lines like:
01 81118 9164.47 0/0:6,0:6:18:.:.:0,18,172:. 0/0:2,0:2:6:.:.:0,6,74:. 0/1:4,5:9:81:.:.:148,0,81:.
What I need is to keep just the first three characters of all the columns containing a colon, i.e.
01 81118 9164.47 0/0 0/0 0/1
Where the number of chars after the first 3 can vary. I started here by removing everything after a colon, but that removes the entire rest of the line, rather than per word:
sed 's/:.*//g' file.txt
Alternately, I've been trying to bring in the word boundary (\b) and hack away at removing everything after colons several times:
sed 's/\b:[^ ]//g' file.txt | sed 's/\b:[^ ]//g'
But this is not a good way to go about it. What's the best approach?
Upvotes: 0
Views: 610
Reputation: 2865
_ = "[[:space:]]*"
if u wanna use the formal POSIX
regex
classecho "${input}" | mawk 'BEGIN { __ = OFS ="\f\r\t" FS = "^"(_ = "[ \t]*")"|(:"(_)")?"(_) _ = sub("[(]..", "&^", FS) } $_ = __$_'
01
81118
9164.47
0/0
0/0
0/1
tested and confirmed working on gawk 5.1.1
, mawk 1.3.4
, mawk 1.996
, and macos nawk
The ultra brute force method would be like :
mawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t'
01 81118 9164.47 0/0 0/0 0/1
to handle leading/trailing edge spaces+tabs in brute-force approach:
gawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t' | column -t
01 81118 9164.47 0/0 0/0 0/1
Upvotes: 0
Reputation: 58483
This might work for you (GNU sed):
sed -E 's/\S*:/\n&/g;s/\n(\S{3})\S*/\1/g;s/\n//g' file
Prepend a newline to any non-whitespaced strings which contains a :
.
If these strings contain at least 3 non-whitespaced characters, remove all but the first 3 characters.
Clean up any strings with :
's which were not 3 non-whitespaced characters in length.
Upvotes: 1
Reputation: 189679
If, as in your example, the colon is not part of the string which should be preserved, try
sed 's/\(\(^\| \)[^ :][^ :][^ :]\)[^ :]*:[^ ]*/\1/g' file
The literal spaces in the character classes may need to be augmented with tabs and possibly other whitespace characters.
(The regex could be prettier if your sed
supports extended regex with -E
or -r
or some such nonstandard option; but this ugly sucker should be portable most anywhere.)
Upvotes: 2
Reputation: 10133
Using GNU sed
with regular expression extensions, a one-liner could be:
sed -E 's/(\S{3})\S*:\S*/\1/g' file
\S
matches non-whitespace characters (a GNU extension).
Upvotes: 1
Reputation: 5975
Using awk
. Print only 3 first characters of any field containing colon, print the rest as is.
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
substr()
is one of the GNU awk string functions.
1
at the end of the statement is equivalent to action {print}
the whole line.
Regarding output format, if input is tab separated and you want to keep the tabs, you can run:
awk 'BEGIN{OFS=FS="\t"} { for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
or another idea is to pretty-print with column -t
(does not insert real \t
but appropriate number of spaces between fields)
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file |column -t
Upvotes: 2
Reputation: 204015
Using a sed that has a -E
are to enable EREs (e.g. GNU or BSD/OSX sed):
$ sed -E 's/([^[:space:]]{3}):[^[:space:]]+/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
With a POSIX sed:
$ sed 's/\([^[:space:]]\{3\}\):[^[:space:]]\{1,\}/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
The above will work regardless of whether the spaces in your input are blanks or tabs or both.
Upvotes: 2