Rajesh M
Rajesh M

Reputation: 31

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).

A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5

My objective is to create a table as follows:


A\t1
B\t2
C\t3
D\t4
E\t5

i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.

I have tried a couple of things but with no luck.

I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work. Here is the snippet:

col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[@]};idx++));do
  echo "@{col1[$idx]} @{col2[$idx]}"
  # I will append to myArr here
done

The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.

Any advice is very much appreciated.

Thank you. :)

Upvotes: 1

Views: 374

Answers (5)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

take advantage of FS and OFS and let them do all the hard work for you

{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'

A   1
B   2
C   3
D   4
E   5

if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps

mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Upvotes: 1

tink
tink

Reputation: 15229

And another sed solution:

Search and replace any literal space followed by any number of non-TAB-characters with nothing.

sed -E 's/ [^\t]+//' file
A       1
B       2
C       3
D       4
E       5

If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...

Upvotes: 1

pjh
pjh

Reputation: 8174

Try

sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
  • The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.

Upvotes: 1

Marek Knappe
Marek Knappe

Reputation: 442

Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)

$ cat ls
A   1
B   2
C x 3
D   4
E y 5

$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5

This code gets all non-empty characters from the front and all non-empty characters from after \t

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203995

Assuming that when you say a space you mean a blank character then using any awk:

awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file

Upvotes: 1

Related Questions