Jpike
Jpike

Reputation: 209

How to remove text from a line after "." in awk?

I have found a few answers to similar questions where the author wants to remove text after a certain character in a string (example). I would like to do a similar thing, however, the character I wish to use is ".", specifically when there are three occurrences e.g. "...". Whenever I use the commands I have found, all of the characters are removed.

Example input file: InFile.txt:

GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_11158-16380_64... *
GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_73046-78268_63... at 100.00%
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_338336-343558_32... at 100.00%
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_482256-487478_64... at 100.00%
GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_429502-434724_64... at 100.00%

Example command 1: awk -F'...' '{print $1}' InFile.txt

Output of Example command 1 is just blank space.

I have tried putting the characters in "" e.g.

Example command 2: awk -F'"..."' '{print $1}' InFile.txt

Which produces this output:

GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_11158-16380_64... *
GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_73046-78268_63... at 100.00%
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_338336-343558_32... at 100.00%
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_482256-487478_64... at 100.00%
GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_429502-434724_64... at 100.00%

Ideally, I'd like the output to look like this:

GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_11158-16380_64
GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_73046-78268_63
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_338336-343558_32
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_482256-487478_64
GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_429502-434724_64

How do I remove text after "..." without replacing all of the text?

Thanks, Jamie

Upvotes: 3

Views: 925

Answers (2)

anubhava
anubhava

Reputation: 784998

Answer from @RavinderSingh13 is great as it shows how to properly handle regex meta characters in FS.

Here is an alternate awk that doesn't use any regex and hence doesn't need any escaping:

awk '{print substr($0, 1, index($0, "...") - 1)}' file
GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_11158-16380_64
GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_73046-78268_63
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_338336-343558_32
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_482256-487478_64
GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_429502-434724_64

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following, you need to escape it to tell awk to treat . as a literal character.

awk -F'\\.\\.\\.' '{print $1}' Input_file

OR as per Sundeep sir's comments use:

awk -F'\\.{3}' '{print $1}' Input_file

Correcting OP's attempt: Also you need NOT to use field separator as -F'"..."' we need not to use " here, instead use only -F'your_delimiter'.

Bonus solution: In case one doesn't want to use field separator then use sub here.

awk '{sub(/\.\.\..*/,"")} 1' Input_file

Upvotes: 4

Related Questions