Mittenchops
Mittenchops

Reputation: 19654

Remove first N lines of a file in place in unix command line

I'm trying to remove the first 37 lines from a very, very large file. I started trying sed and awk, but they seem to require copying the data to a new file. I'm looking for a "remove lines in place" method, that unlike sed -i is not making copies of any kind, but rather is just removing lines from the existing file.

Here's what I've done...

awk 'NR > 37' file.xml > 'f2.xml'
sed -i '1,37d' file.xml

Both of these seem to do a full copy. Is there any other simple CLI that can do this quickly without a full document traversal?

Upvotes: 12

Views: 38996

Answers (5)

jkool702
jkool702

Reputation: 29

This is more directly an answer to another question on stack overflow, but that question was closed and listed as a duplicate to this question...so, I'm answering it here.

You can efficiently truncate the start of a file, even if other processes have that file open for writing and/or reading and are actively writing and/or reading it, using:

nBytes=_____   # number of bytes to remove from start of ${file}
fallocate -p -o 0 -l ${nBytes} ${file}

As I understand it, this will keep the logical size of the file (the size that the filesystem advertises the file as) unchanged, but any full blocks in the specified byte range will have the blocks deallocated, so that queries of data from that byte range will point to nothing (instead of an actual block on the underlying device) and will return zero/null.

Freeing the underlying blocks means that they can be used for other things, which (id assume) is typically why you would want to truncate data from a file to begin with. At the same time the apparent filesize doesnt change, so other processes that have the file open for I/O are blissfully unaware that anything has changed (so long as you take care to ensure that the part being truncated isnt being used by anything else, which you probably should ensure).

Ultimately, this allows for setups where

  1. process A is constantly appending data to the end of the file
  2. process B is reading data in chunks as it is being appended (and presumably doing something useful with it)
  3. process C is removing already-read data from the start of the file to reduce disk usage (or memory usage if it is pn a tmpfs).

NOTE: though not required, it is more efficient when the number of bytes being removed is a multiple of the filesystem block size, which is likely either 512 or 4096 bytes. Any partial blocks will have zero's written to the data to be removed (instead of having the entire block deallocated), meaning you don't actually get that space back.


REQUIREMENTS: For this to work, you need:

  1. to have fallocate available (it isnt a bash builtin, though neither are sed/ed/awk/dd/tail/perl/...)

  2. to be able to get a count in bytes (not lines) of how much data you want removed from the start of the file

  3. the file needs to be on one of a handful of filesystems and be using a recent enough kernel. Filesystem/kernel combinations that work are:

  • XFS (Linux 2.6.38+)
  • ext4 (Linux 3.0+)
  • tmpfs (Linux 3.5+)
  • Btrfs (Linux 3.7+)
  • gfs2 (Linux 4.16+)

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203129

There's no simple way to do inplace editing using UNIX utilities, but here's one inplace file modification solution that you might be able to modify to work for you (courtesy of Robert Bonomi at https://groups.google.com/forum/#!topic/comp.unix.shell/5PRRZIP0v64):

bytes=$(head -37 "$file" |wc -c)
dd if="$file" bs="$bytes" skip=1 conv=notrunc of="$file"

The final file should be $bytes bytes smaller than the original (since the goal was to remove $bytes bytes from the beginning), so to finish we must remove the final $bytes bytes. We're using conv=notrunc above to make sure that the file doesn't get completely emptied rather than just truncated (see below for example). On a GNU system such as Linux doing the truncation afterwards can be accomplished by:

truncate -s "-$bytes" "$file"

For example to delete the first 5 lines from this 12-line file

$ wc -l file
12 file

$ cat file
When chapman billies leave the street,
And drouthy neibors, neibors, meet;
As market days are wearing late,
And folk begin to tak the gate,
While we sit bousing at the nappy,
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

First use dd to remove the target 5 lines (really "$bytes" bytes) from the start of the file and copy the rest from the end to the front but leave the trailing "$bytes" bytes as-is:

$ bytes=$(head -5 file |wc -c)

$ dd if=file bs="$bytes" skip=1 conv=notrunc of=file
1+1 records in
1+1 records out
253 bytes copied, 0.0038458 s, 65.8 kB/s

$ wc -l file
12 file

$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
s, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

and then use truncate to remove those leftover bytes from the end:

$ truncate -s "-$bytes" "file"

$ wc -l file
7 file

$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

If we had tried the above without dd ... conv=notrunc:

$ wc -l file
12 file
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 of=file
dd: file: cannot skip to specified offset
0+0 records in
0+0 records out
0 bytes copied, 0.0042254 s, 0.0 kB/s
$ wc -l file
0 file

See the google groups thread I referenced for other suggestions and info.

Upvotes: 14

Peteris
Peteris

Reputation: 3325

The copy will have to be created at some point - why not at the time of reading the "modified" file; streaming the altered copy instead of storing it?

What I'm thinking - create a named pipe "file2" that is the output of that same awk 'NR > 37' file.xml or whatever; then whoever reads file2 will not see the first 37 lines.

The drawback is that it will run awk each time the file is processed, so it's feasible only if it's read rarely.

Upvotes: 2

that other guy
that other guy

Reputation: 123400

Unix file semantics do not allow truncating the front part of a file.

All solutions will be based on either:

  1. Reading the file into memory and then writing it back (ed, ex, other editors). This should be fine if your file is <1GB or if you have plenty of RAM.
  2. Writing a second copy and optionally replacing the original (sed -i, awk/tail > foo). This is fine as long as you have enough free diskspace for a copy, and don't mind the wait.

If the file is too large for any of these to work for you, you may be able to work around it depending on what's reading your file.

Perhaps your reader skips comments or blank lines? If so, you can then craft a message the reader ignores, make sure it has the same number of bytes as the 37 first lines in your file, and overwrite the start of the file with dd if=yourdata of=file conv=notrunc.

Upvotes: 6

gniourf_gniourf
gniourf_gniourf

Reputation: 46813

is the standard editor:

ed -s file <<< $'1,37d\nwq'

Upvotes: 7

Related Questions