emergence92
emergence92

Reputation: 3

Fastest way read single line in giant file

So I have a website and I need to access a single line (line number is known) in a giant text file (~2GB).

I came to the conclusion that

system_exec("sed -n 3p << /file/whatever.txt");

in PHP is the most efficient way.

But I don't feel very comfortable using it, it seems like a bad hack and insecure. Is it really ok to use it? Is this somehow possible without a PHP framework around? Or are there more efficient ways to do this?

Upvotes: 0

Views: 267

Answers (3)

NeronLeVelu
NeronLeVelu

Reputation: 10039

On my system I have a totaly different conclusion Environnement: AIX under KSH

FileName=listOfBig.txt
# ls -l -> 239.070.208 bytes
# wc -l listOfBig.txt | read FileSize Ignore
FileSize=638976

# take a portion of 8 lines at 1000 lines of the end
LineToStart=$(( ${FileSize} - 1024 ))
LineToTake=8
LineToStop=$(( ${LineToStart} + ${LineToTake} - 1 ))

time sed -n "${LineToStart},${LineToStop} p;${LineToStop} q" ${FileName} >/dev/null
real    0m1.49s
user    0m0.45s
sys     0m0.41s

time sed "${LineToStart},${LineToStop} !d;${LineToStop} q" ${FileName} >/dev/null
real    0m1.51s
user    0m0.45s
sys     0m0.42s

time tail -n +${LineToStart} ${FileName} | head -${LineToTake} >/dev/null
real    0m0.34s
user    0m0.00s
sys     0m0.00s

time head -${LineToStop}  ${FileName} | tail -${LineToTake} >/dev/null
real    0m0.84s
user    0m0.75s
sys     0m0.23s

There is certainly a small advantage to second and following test that 1st (cache, ...) but not very different

So, in this test, sed is lot slower (not a GNU version of tools like on linux).

There is another issue that is not explain in case of huge file (could happend on small but rarely) is the problem of piped stream if file is changing (often the case on log). I have once the problem and should create a temporary file (unhoppefully huge also) to treat the other request of the line if any.

Upvotes: 0

user1902824
user1902824

Reputation:

Here are various ways you can offset into file, along with some crude benchmarks.

I created a text file with 90M lines. Each line contained 'something#####', though the numbers don't match up with the actual row (to make creating the sample data faster).

$ wc bigfile.txt
90000000 90000000 1340001000 bigfile.txt

$ ls -lrth bigfile.txt
-rw-rw-r--  1 admin  wheel   1.2G Mar  8 09:37 bigfile.txt

These benchmarks were performed on a 1.3GHz i5, 4GB RAM, MacBook Air (11-inch, Mid 2013) running OS 10.10.2.

First up, is awk. I really expected better.

$ time awk 'NR == 10000000{print;exit}' bigfile.txt
something99999

real    0m12.716s
user    0m12.529s
sys     0m0.117s

tail performed a little better, though still quite slow.

$ time tail -n +10000000 bigfile.txt | head -n 1
something99999

real    0m10.393s
user    0m10.311s
sys     0m0.066s

As you found out, sed way outperforms the other contenders so far, for some reason. Though, still unacceptably slow.

$ time sed -n '10000000{p;q;}' bigfile.txt
something99999

real    0m3.846s
user    0m3.772s
sys     0m0.053s

If you have regular data (same number of bytes per line or can deterministically count number of bytes per line), you can forgo reading the file altogether and directly offset into the file. This is the fastest option, but also the most restrictive in terms of data format. This is what William Pursell was getting at when he suggested padding your data to a fixed size.

$ time tail -c +10000000 bigfile.txt | head -n 1
thing71851

real    0m0.020s
user    0m0.011s
sys     0m0.006s

However, if you have a 2G text file, you should consider using a proper database.

$ time sqlite3 bigfile.db << EOF
> create table bigdb(data text);
> .import bigfile.txt bigdb
> EOF

real    3m16.650s
user    3m3.703s
sys     0m4.221s

$ ls -lrth bigfile.db
-rw-r--r--  1 admin  wheel   1.9G Mar  8 10:16 bigfile.db

Now that you have a database, you should be able to get blazing fast speeds right? Only if you use it properly. OFFSET (the first argument to LIMIT) is notorious for being ridiculously slow, and should be avoided.

$ time sqlite3 bigfile.db <<< 'select * from bigdb limit 10000000-1, 1;'
something99999

real    0m2.156s
user    0m0.688s
sys     0m0.440s

You should have a proper primary key, or use sqlite's handy internal column ROWID to get optimal performance.

$ time sqlite3 bigfile.db <<< 'select * from bigdb where ROWID == 10000000;'
something99999

real    0m0.017s
user    0m0.003s
sys     0m0.005s

Upvotes: 1

josifoski
josifoski

Reputation: 1726

With little modification fastest way for printing single line in giant file is using also q (Quit) command

sed -n '3{p;q}' yourFile

This will print 3rd line, and sed will stop working then.

Upvotes: 2

Related Questions