noam42
noam42

Reputation: 23

Keep first 3 characters of every word containing a character

I have a large text file with lines like:

01    81118   9164.47    0/0:6,0:6:18:.:.:0,18,172:.   0/0:2,0:2:6:.:.:0,6,74:.  0/1:4,5:9:81:.:.:148,0,81:.

What I need is to keep just the first three characters of all the columns containing a colon, i.e.

01  81118   9164.47  0/0  0/0  0/1

Where the number of chars after the first 3 can vary. I started here by removing everything after a colon, but that removes the entire rest of the line, rather than per word:
sed 's/:.*//g' file.txt

Alternately, I've been trying to bring in the word boundary (\b) and hack away at removing everything after colons several times:

sed 's/\b:[^ ]//g' file.txt | sed 's/\b:[^ ]//g'

But this is not a good way to go about it. What's the best approach?

Upvotes: 0

Views: 610

Answers (6)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

  • optional : set _ = "[[:space:]]*" if u wanna use the formal POSIX regex class
echo "${input}" | 
                   
mawk 'BEGIN { __ = OFS ="\f\r\t"
              FS = "^"(_ = "[ \t]*")"|(:"(_)")?"(_)
               _ = sub("[(]..", "&^", FS) } $_ = __$_'
01
81118
9164.47
0/0
0/0
0/1

tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macos nawk

The ultra brute force method would be like :

mawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t'
01    81118    9164.47    0/0    0/0    0/1

to handle leading/trailing edge spaces+tabs in brute-force approach:

gawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t' | column -t 
01  81118  9164.47  0/0  0/0  0/1

Upvotes: 0

potong
potong

Reputation: 58483

This might work for you (GNU sed):

sed -E 's/\S*:/\n&/g;s/\n(\S{3})\S*/\1/g;s/\n//g' file

Prepend a newline to any non-whitespaced strings which contains a :.

If these strings contain at least 3 non-whitespaced characters, remove all but the first 3 characters.

Clean up any strings with :'s which were not 3 non-whitespaced characters in length.

Upvotes: 1

tripleee
tripleee

Reputation: 189679

If, as in your example, the colon is not part of the string which should be preserved, try

sed 's/\(\(^\| \)[^ :][^ :][^ :]\)[^ :]*:[^ ]*/\1/g' file

The literal spaces in the character classes may need to be augmented with tabs and possibly other whitespace characters.

(The regex could be prettier if your sed supports extended regex with -E or -r or some such nonstandard option; but this ugly sucker should be portable most anywhere.)

Upvotes: 2

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10133

Using GNU sed with regular expression extensions, a one-liner could be:

sed -E 's/(\S{3})\S*:\S*/\1/g' file

\S matches non-whitespace characters (a GNU extension).

Upvotes: 1

thanasisp
thanasisp

Reputation: 5975

Using awk. Print only 3 first characters of any field containing colon, print the rest as is.

awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
  • substr() is one of the GNU awk string functions.

  • 1 at the end of the statement is equivalent to action {print} the whole line.


Regarding output format, if input is tab separated and you want to keep the tabs, you can run:

awk 'BEGIN{OFS=FS="\t"} { for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file

or another idea is to pretty-print with column -t (does not insert real \t but appropriate number of spaces between fields)

awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file |column -t

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204015

Using a sed that has a -E are to enable EREs (e.g. GNU or BSD/OSX sed):

$ sed -E 's/([^[:space:]]{3}):[^[:space:]]+/\1/g' file
01    81118   9164.47    0/0   0/0  0/1

With a POSIX sed:

$ sed 's/\([^[:space:]]\{3\}\):[^[:space:]]\{1,\}/\1/g' file
01    81118   9164.47    0/0   0/0  0/1

The above will work regardless of whether the spaces in your input are blanks or tabs or both.

Upvotes: 2

Related Questions