Reputation: 165
I have a file such as:
scaffold_10_1 YP_02917613.1 0.722 397 90 1 55021 53805 70 446 1.803E-180 566
scaffold_282_0 YP_004091438.1 0.799 317 102 1 55023 53395 66 442 2.282E-173 546
scaffold_15 YP_009676312.1 0.021 327 14 1 55320 52895
IDBA_scaffold_66230_1 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
scf7180005161552_2 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
And the idea is simply to remove the last number part of all names in the first column and get:
scaffold_10 YP_02917613.1 0.722 397 90 1 55021 53805 70 446 1.803E-180 566
scaffold_282 YP_004091438.1 0.799 317 102 1 55023 53395 66 442 2.282E-173 546
scaffold_15 YP_009676312.1 0.021 327 14 1 55320 52895
IDBA_scaffold_66230 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
scf7180005161552 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
As you can see, sometime there is no _number
after the first _number_
as
scaffold_15
Have you an idea to deal with that?
Thank you for you help.
For brunorey: Here is tha table I got:
scaffold_10 YP_02917613.1 0.722 397 90 1 55021 53805 70 446 1.803E-180 566
scaffold_282 YP_004091438.1 0.799 317 102 1 55023 53395 66 442 2.282E-173 546
scaffold YP_009676312.1 0.021 327 14 1 55320 52895
IDBAscaffold_66230_1 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
scf7180005161552 YP_004091438.1 0.789 317 122 1 55023 53395 66 442 2.282E-173 506
as you can see, the 15 of the scaffold_15
has been removed but I want to keep it.
Upvotes: 0
Views: 47
Reputation: 2491
You can try with this sed :
sed 's/\(^[^_]*_[^_]*\)\(_[0-9]\{1,\}\)\([[:blank:]]\{1,\}.*\)/\1\3/' infile
With data like IDBA_scaffold_66230_1, you can try this awk :
awk 'BEGIN{FS=OFS="\t"}$1~/.*_[0-9]+_[0-9]+$/{sub(/_[0-9]+$/,"",$1)}1' infile
Upvotes: 1
Reputation: 2255
Try
cat file.csv | sed -e 's/\([A-Ba-b0-9_]*\)\(_[0-9]*\)\(.*\)/\1\3/' > file-without-number.csv
How does this work?
sed
is the inline editor, s/
will search and replace. Syntax is slash sparated: s/search_patter/replace_pattern
.\([A-Ba-b0-9_]*\)\(_[0-9]*\)\(.*\)
. It will split the line into 3 parts:
\([A-Ba-b0-9_]*\)
any string composed by letters, numbers or __number
(matching \(_[0-9]*\)
)\(.*\)
)\1\3
Will replace the string with only parts 1 and 3, thus removing 2Upvotes: 2