Reputation: 3
So I have a website and I need to access a single line (line number is known) in a giant text file (~2GB).
I came to the conclusion that
system_exec("sed -n 3p << /file/whatever.txt");
in PHP is the most efficient way.
But I don't feel very comfortable using it, it seems like a bad hack and insecure. Is it really ok to use it? Is this somehow possible without a PHP framework around? Or are there more efficient ways to do this?
Upvotes: 0
Views: 267
Reputation: 10039
On my system I have a totaly different conclusion Environnement: AIX under KSH
FileName=listOfBig.txt
# ls -l -> 239.070.208 bytes
# wc -l listOfBig.txt | read FileSize Ignore
FileSize=638976
# take a portion of 8 lines at 1000 lines of the end
LineToStart=$(( ${FileSize} - 1024 ))
LineToTake=8
LineToStop=$(( ${LineToStart} + ${LineToTake} - 1 ))
time sed -n "${LineToStart},${LineToStop} p;${LineToStop} q" ${FileName} >/dev/null
real 0m1.49s
user 0m0.45s
sys 0m0.41s
time sed "${LineToStart},${LineToStop} !d;${LineToStop} q" ${FileName} >/dev/null
real 0m1.51s
user 0m0.45s
sys 0m0.42s
time tail -n +${LineToStart} ${FileName} | head -${LineToTake} >/dev/null
real 0m0.34s
user 0m0.00s
sys 0m0.00s
time head -${LineToStop} ${FileName} | tail -${LineToTake} >/dev/null
real 0m0.84s
user 0m0.75s
sys 0m0.23s
There is certainly a small advantage to second and following test that 1st (cache, ...) but not very different
So, in this test, sed is lot slower (not a GNU version of tools like on linux).
There is another issue that is not explain in case of huge file (could happend on small but rarely) is the problem of piped stream if file is changing (often the case on log). I have once the problem and should create a temporary file (unhoppefully huge also) to treat the other request of the line if any.
Upvotes: 0
Reputation:
Here are various ways you can offset into file, along with some crude benchmarks.
I created a text file with 90M lines. Each line contained 'something#####', though the numbers don't match up with the actual row (to make creating the sample data faster).
$ wc bigfile.txt
90000000 90000000 1340001000 bigfile.txt
$ ls -lrth bigfile.txt
-rw-rw-r-- 1 admin wheel 1.2G Mar 8 09:37 bigfile.txt
These benchmarks were performed on a 1.3GHz i5, 4GB RAM, MacBook Air (11-inch, Mid 2013) running OS 10.10.2.
First up, is awk
. I really expected better.
$ time awk 'NR == 10000000{print;exit}' bigfile.txt
something99999
real 0m12.716s
user 0m12.529s
sys 0m0.117s
tail
performed a little better, though still quite slow.
$ time tail -n +10000000 bigfile.txt | head -n 1
something99999
real 0m10.393s
user 0m10.311s
sys 0m0.066s
As you found out, sed
way outperforms the other contenders so far, for some reason. Though, still unacceptably slow.
$ time sed -n '10000000{p;q;}' bigfile.txt
something99999
real 0m3.846s
user 0m3.772s
sys 0m0.053s
If you have regular data (same number of bytes per line or can deterministically count number of bytes per line), you can forgo reading the file altogether and directly offset into the file. This is the fastest option, but also the most restrictive in terms of data format. This is what William Pursell was getting at when he suggested padding your data to a fixed size.
$ time tail -c +10000000 bigfile.txt | head -n 1
thing71851
real 0m0.020s
user 0m0.011s
sys 0m0.006s
However, if you have a 2G text file, you should consider using a proper database.
$ time sqlite3 bigfile.db << EOF
> create table bigdb(data text);
> .import bigfile.txt bigdb
> EOF
real 3m16.650s
user 3m3.703s
sys 0m4.221s
$ ls -lrth bigfile.db
-rw-r--r-- 1 admin wheel 1.9G Mar 8 10:16 bigfile.db
Now that you have a database, you should be able to get blazing fast speeds right? Only if you use it properly. OFFSET
(the first argument to LIMIT
) is notorious for being ridiculously slow, and should be avoided.
$ time sqlite3 bigfile.db <<< 'select * from bigdb limit 10000000-1, 1;'
something99999
real 0m2.156s
user 0m0.688s
sys 0m0.440s
You should have a proper primary key, or use sqlite's handy internal column ROWID
to get optimal performance.
$ time sqlite3 bigfile.db <<< 'select * from bigdb where ROWID == 10000000;'
something99999
real 0m0.017s
user 0m0.003s
sys 0m0.005s
Upvotes: 1
Reputation: 1726
With little modification fastest way for printing single line in giant file is using also q (Quit) command
sed -n '3{p;q}' yourFile
This will print 3rd line, and sed will stop working then.
Upvotes: 2