Reputation: 19654
I'm trying to remove the first 37 lines from a very, very large file. I started trying sed and awk, but they seem to require copying the data to a new file. I'm looking for a "remove lines in place" method, that unlike sed -i
is not making copies of any kind, but rather is just removing lines from the existing file.
Here's what I've done...
awk 'NR > 37' file.xml > 'f2.xml'
sed -i '1,37d' file.xml
Both of these seem to do a full copy. Is there any other simple CLI that can do this quickly without a full document traversal?
Upvotes: 12
Views: 38996
Reputation: 29
This is more directly an answer to another question on stack overflow, but that question was closed and listed as a duplicate to this question...so, I'm answering it here.
You can efficiently truncate the start of a file, even if other processes have that file open for writing and/or reading and are actively writing and/or reading it, using:
nBytes=_____ # number of bytes to remove from start of ${file}
fallocate -p -o 0 -l ${nBytes} ${file}
As I understand it, this will keep the logical size of the file (the size that the filesystem advertises the file as) unchanged, but any full blocks in the specified byte range will have the blocks deallocated, so that queries of data from that byte range will point to nothing (instead of an actual block on the underlying device) and will return zero/null.
Freeing the underlying blocks means that they can be used for other things, which (id assume) is typically why you would want to truncate data from a file to begin with. At the same time the apparent filesize doesnt change, so other processes that have the file open for I/O are blissfully unaware that anything has changed (so long as you take care to ensure that the part being truncated isnt being used by anything else, which you probably should ensure).
Ultimately, this allows for setups where
NOTE: though not required, it is more efficient when the number of bytes being removed is a multiple of the filesystem block size, which is likely either 512 or 4096 bytes. Any partial blocks will have zero's written to the data to be removed (instead of having the entire block deallocated), meaning you don't actually get that space back.
REQUIREMENTS: For this to work, you need:
to have fallocate
available (it isnt a bash builtin, though neither are sed
/ed
/awk
/dd
/tail
/perl
/...)
to be able to get a count in bytes (not lines) of how much data you want removed from the start of the file
the file needs to be on one of a handful of filesystems and be using a recent enough kernel. Filesystem/kernel combinations that work are:
Upvotes: 0
Reputation: 203129
There's no simple way to do inplace editing using UNIX utilities, but here's one inplace file modification solution that you might be able to modify to work for you (courtesy of Robert Bonomi at https://groups.google.com/forum/#!topic/comp.unix.shell/5PRRZIP0v64):
bytes=$(head -37 "$file" |wc -c)
dd if="$file" bs="$bytes" skip=1 conv=notrunc of="$file"
The final file should be $bytes
bytes smaller than the original (since the goal was to remove $bytes
bytes from the beginning), so to finish we must remove the final $bytes
bytes. We're using conv=notrunc
above to make sure that the file doesn't get completely emptied rather than just truncated (see below for example). On a GNU system such as Linux doing the truncation afterwards can be accomplished by:
truncate -s "-$bytes" "$file"
For example to delete the first 5 lines from this 12-line file
$ wc -l file
12 file
$ cat file
When chapman billies leave the street,
And drouthy neibors, neibors, meet;
As market days are wearing late,
And folk begin to tak the gate,
While we sit bousing at the nappy,
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
First use dd
to remove the target 5 lines (really "$bytes" bytes) from the start of the file and copy the rest from the end to the front but leave the trailing "$bytes" bytes as-is:
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 conv=notrunc of=file
1+1 records in
1+1 records out
253 bytes copied, 0.0038458 s, 65.8 kB/s
$ wc -l file
12 file
$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
s, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
and then use truncate
to remove those leftover bytes from the end:
$ truncate -s "-$bytes" "file"
$ wc -l file
7 file
$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
If we had tried the above without dd ... conv=notrunc
:
$ wc -l file
12 file
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 of=file
dd: file: cannot skip to specified offset
0+0 records in
0+0 records out
0 bytes copied, 0.0042254 s, 0.0 kB/s
$ wc -l file
0 file
See the google groups thread I referenced for other suggestions and info.
Upvotes: 14
Reputation: 3325
The copy will have to be created at some point - why not at the time of reading the "modified" file; streaming the altered copy instead of storing it?
What I'm thinking - create a named pipe "file2" that is the output of that same awk 'NR > 37' file.xml or whatever; then whoever reads file2 will not see the first 37 lines.
The drawback is that it will run awk each time the file is processed, so it's feasible only if it's read rarely.
Upvotes: 2
Reputation: 123400
Unix file semantics do not allow truncating the front part of a file.
All solutions will be based on either:
ed
, ex
, other editors). This should be fine if your file is <1GB or if you have plenty of RAM.sed -i
, awk
/tail > foo
). This is fine as long as you have enough free diskspace for a copy, and don't mind the wait.If the file is too large for any of these to work for you, you may be able to work around it depending on what's reading your file.
Perhaps your reader skips comments or blank lines? If so, you can then craft a message the reader ignores, make sure it has the same number of bytes as the 37 first lines in your file, and overwrite the start of the file with dd if=yourdata of=file conv=notrunc
.
Upvotes: 6