Reputation: 375
I have data in the following form in a file:
<http://purl.uniprot.org/here> <http://purl.uniprot.org/here/unipot/purl>
<http://purl.uniprot.org/uniprot/Q196Y7> <http://purl.uniprot.org/core/annotation>
I want to remove all "http://purl.uniprot.org" which are within the angular brackets. Such that the output which I get is
<here> <here/unipot/purl>
<uniprot/Q196Y7> <core/annotation>
I tried to do so using vi's replace command. But it turned out to be quite slow as my file is of 1TB. Is there a more efficient way to do the same using linux/python
I know I can use sed but sed find's patterns and deletes them whereas I want to delete the exact contents
Upvotes: 1
Views: 107
Reputation: 328760
As Radu Rădeanu said, sed
is a good tool for replacing strings in files since it works on streams instead of trying to load the whole file into memory.
But sed
uses regular expressions and in your case (1TB of input data), this might be too slow. Unix tools can often handle files of arbitrary size and they are surprisingly efficient but corner cases might be too much.
If you need to optimize the process, here are a few pointers:
Split the huge file into smaller ones. For example, if this is a log file, create a single file per day instead of concatenating everything into one huge file. That way, you can strip the string once in each daily file.
Write a small C program that searches for the exact string (instead of using a regexp). You can then use optimizations like Boyer-Moore to get a huge performance boost. You should also consider using memory-mapped I/O.
Upvotes: 1
Reputation: 3079
what do you mean by "But it turned out to be quite" ? quite what? If it's me , vi is just a very good tool.run this command:
:s/http:\/\/purl.uniprot.org\//g
Upvotes: 0
Reputation: 2732
This should work from command-line:
sed -i 's/http:\/\/purl.uniprot.org\///g' /path/to/filename
You can try first without -i
argument to see the output in your console.
Upvotes: 1