Reputation: 13
I have a large tab delimited txt file which contains 22 columns and up to 10^6 lines. Column 7 of the file is an 11 character string which I need to edit as follows: the last 5 characters (chr 7-11) need to be the first 5 characters.
For example, current file looks like:
col1a col2a col3a col4a col5a col6a XXXXXXAAAAA col8a ...
col1b col2b col3b col4b col5b col6b XXXXXXBBBBB col8b ...
col1c col2c col3c col4c col5c col6c XXXXXXCCCCC col8c ...
col1d col2d col3d col4d col5d col6d XXXXXXDDDDD col8d ...
....
The desired output is:
col1a col2a col3a col4a col5a col6a AAAAAXXXXXX col8a ...
col1b col2b col3b col4b col5b col6b BBBBBXXXXXX col8b ...
col1c col2c col3c col4c col5c col6c CCCCCXXXXXX col8c ...
col1d col2d col3d col4d col5d col6d DDDDDXXXXXX col8d ...
....
It seems to me that one way of doing this is to cut the relevant column to two using cut
, then combining them again using perhaps paste
? So far I have only managed doing this in multiple steps (original file name is short):
1) Using awk
and cut
to create two new files, one for each half of the column
awk ' BEGIN { FS="\t"; OFS="\t" } {print $7} ' short | cut -c1-6 > file1
awk ' BEGIN { FS="\t"; OFS="\t" } {print $7} ' short | cut -c7-11 > file2
2) Using paste
to paste them back together
paste -d "" file2 file1 > file12
3) Using paste
to paste new file to original
paste -d"\t" short file12 > shortCom
4) Using 'awk' to replace original column 7 with new one:
awk ' BEGIN { FS="\t"; OFS="\t" } {
$7 = $23
print $0 } ' shortCom
This is obviously a very long and cumbersome process to do something that I suspect is actually quite simple... I would be very grateful for any advice you may have on improving this, in order to make this quicker and more efficient.
Thanks!!
Upvotes: 1
Views: 2159
Reputation: 77105
This should work:
awk '{y=substr($7,1,5);z=substr($7,6); $7=z""y;}1' inputfile
If you have gnu awk
then:
gawk '{$7=gensub(/(.{5})(.{6})/ , "\\2\\1" , "g" , $7)}1' inputfile
Upvotes: 1