Alice Everett
Alice Everett

Reputation: 375

delete exact string from a large file?

I have data in the following form in a file:

    <http://purl.uniprot.org/here>   <http://purl.uniprot.org/here/unipot/purl>
    <http://purl.uniprot.org/uniprot/Q196Y7>        <http://purl.uniprot.org/core/annotation>

I want to remove all "http://purl.uniprot.org" which are within the angular brackets. Such that the output which I get is

    <here>   <here/unipot/purl>
    <uniprot/Q196Y7>        <core/annotation>

I tried to do so using vi's replace command. But it turned out to be quite slow as my file is of 1TB. Is there a more efficient way to do the same using linux/python

I know I can use sed but sed find's patterns and deletes them whereas I want to delete the exact contents

Upvotes: 1

Views: 107

Answers (3)

Aaron Digulla
Aaron Digulla

Reputation: 328760

As Radu Rădeanu said, sed is a good tool for replacing strings in files since it works on streams instead of trying to load the whole file into memory.

But sed uses regular expressions and in your case (1TB of input data), this might be too slow. Unix tools can often handle files of arbitrary size and they are surprisingly efficient but corner cases might be too much.

If you need to optimize the process, here are a few pointers:

  1. Split the huge file into smaller ones. For example, if this is a log file, create a single file per day instead of concatenating everything into one huge file. That way, you can strip the string once in each daily file.

  2. Write a small C program that searches for the exact string (instead of using a regexp). You can then use optimizations like Boyer-Moore to get a huge performance boost. You should also consider using memory-mapped I/O.

Upvotes: 1

wuchang
wuchang

Reputation: 3079

what do you mean by "But it turned out to be quite" ? quite what? If it's me , vi is just a very good tool.run this command:

:s/http:\/\/purl.uniprot.org\//g

Upvotes: 0

Radu Rădeanu
Radu Rădeanu

Reputation: 2732

This should work from command-line:

sed -i 's/http:\/\/purl.uniprot.org\///g' /path/to/filename

You can try first without -i argument to see the output in your console.

Upvotes: 1

Related Questions