bewolf
bewolf

Reputation: 165

Remove a specific pattern in a tab file

I have a file such as:

scaffold_10_1   YP_02917613.1   0.722   397 90  1   55021   53805   70  446 1.803E-180  566
scaffold_282_0  YP_004091438.1  0.799   317 102 1   55023   53395   66  442 2.282E-173  546
scaffold_15     YP_009676312.1  0.021   327 14  1   55320   52895
IDBA_scaffold_66230_1   YP_004091438.1         0.789    317 122 1   55023   53395   66  442 2.282E-173  506
scf7180005161552_2      YP_004091438.1          0.789   317 122 1   55023   53395   66  442 2.282E-173  506

And the idea is simply to remove the last number part of all names in the first column and get:

scaffold_10 YP_02917613.1   0.722   397 90  1   55021   53805   70  446 1.803E-180  566
scaffold_282    YP_004091438.1  0.799   317 102 1   55023   53395   66  442 2.282E-173  546
scaffold_15     YP_009676312.1  0.021   327 14  1   55320   52895
IDBA_scaffold_66230    YP_004091438.1         0.789 317 122 1   55023   53395   66  442 2.282E-173  506
scf7180005161552     YP_004091438.1            0.789    317 122 1   55023   53395   66  442 2.282E-173  506

As you can see, sometime there is no _number after the first _number_ as

scaffold_15

Have you an idea to deal with that?

Thank you for you help.

For brunorey: Here is tha table I got:

scaffold_10   YP_02917613.1   0.722   397 90  1   55021   53805   70  446 1.803E-180  566
scaffold_282  YP_004091438.1  0.799   317 102 1   55023   53395   66  442 2.282E-173  546
scaffold     YP_009676312.1  0.021   327 14  1   55320   52895
IDBAscaffold_66230_1   YP_004091438.1         0.789    317 122 1   55023   53395   66  442 2.282E-173  506
scf7180005161552      YP_004091438.1          0.789   317 122 1   55023   53395   66  442 2.282E-173  506

as you can see, the 15 of the scaffold_15 has been removed but I want to keep it.

Upvotes: 0

Views: 47

Answers (2)

ctac_
ctac_

Reputation: 2491

You can try with this sed :

sed 's/\(^[^_]*_[^_]*\)\(_[0-9]\{1,\}\)\([[:blank:]]\{1,\}.*\)/\1\3/' infile

With data like IDBA_scaffold_66230_1, you can try this awk :

awk 'BEGIN{FS=OFS="\t"}$1~/.*_[0-9]+_[0-9]+$/{sub(/_[0-9]+$/,"",$1)}1' infile

Upvotes: 1

brunorey
brunorey

Reputation: 2255

Try

cat file.csv | sed -e 's/\([A-Ba-b0-9_]*\)\(_[0-9]*\)\(.*\)/\1\3/' > file-without-number.csv

How does this work?

  • sed is the inline editor,
  • Running sed with command s/ will search and replace. Syntax is slash sparated: s/search_patter/replace_pattern.
  • Search pattern is \([A-Ba-b0-9_]*\)\(_[0-9]*\)\(.*\). It will split the line into 3 parts:
    • 1) \([A-Ba-b0-9_]*\) any string composed by letters, numbers or _
    • 2) ...followed by _number (matching \(_[0-9]*\))
    • 3) The rest of the line (matching \(.*\))
  • \1\3 Will replace the string with only parts 1 and 3, thus removing 2

Upvotes: 2

Related Questions