Reputation: 31
My text file should be of two columns separated by a tab-space (represented by \t
) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s
).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3
I can discard the x
that is present after space and store the columns as C\t3
.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t
into independent columns and then cut the first column based on \s
and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[@]};idx++));do
echo "@{col1[$idx]} @{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5
. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
Upvotes: 1
Views: 374
Reputation: 2865
take advantage of FS
and OFS
and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'
Upvotes: 1
Reputation: 15229
And another sed
solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//'
...
Upvotes: 1
Reputation: 8174
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
$'...'
) feature of Bash is used to make tab characters visible as \t
.Upvotes: 1
Reputation: 442
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Upvotes: 1
Reputation: 203995
Assuming that when you say a space
you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Upvotes: 1