Reputation: 12391
I have a text file of around 10 GB, I need to have some processing on the textual data in the file. What is the best way to read, access and process such a huge file?
I am thinking to break the file into chunks and then process it by handling smaller files (or can be in buffer - much better) and then would merge the results. More like map-reduce paradigm but that will not be using big data technologies.
Upvotes: 2
Views: 2025
Reputation: 25191
Options how to process data:
load everything into RAM and process it at once - if it fits there...
process only one line at once; for example line by line. Great if all the processing requires no other information than the processed line itself - no shared storage, no database...
combine the two above: read a bunch of items (text lines etc.), process them, read another bunch of items... If you want/need to use a shared storage (database) while processing then processing in batches is more effective than one-by-one.
"Hadoop style": use well scalable algorithms and data structures, like map, sort, maybe windowing, event streams, binary search - and connect them together into data processing pipeline. No shared storage. Basically this is the "line by line" approach, but with some magic that will give you "the right lines" (sorted, aggregated, grouped by some key, top N, last N...).
A few tips from my experience:
Use compression. Even if your disks are large enough the disk (or network!) I/O is usually the bottleneck.
Use batches/chunks wherever possible to process/send/save/load... more items at once. If dealing with database: process (select, insert, update...) more items at once. For example MongoDB has bulk operations. This saves network I/O overhead.
Try to minimize syscall count (by doing stuff in batches like mentioned above). Each syscall means that CPU has to switch context, CPU cache content is gone, operating system may have to communicate with hardware...
Use all CPU cores. Some platforms (Python, Ruby) are better here with processes than threads.
Use CPU cache if possible. For example "linear" data structures like arrays or C++ vector
are better in this regard than, say, linked lists. Use sorted array and binary search instead of dict/map and key lookup - smaller memory footprint, smaller memory fragmentation, bigger chance of CPU cache hits.
Split input data to parts so even loading the data can be easily parallelized.
Now, how to do it:
You can use Hadoop or similar tools in "localhost mode" - no need to deploy full stack with YARN, Zookeeper and whatnot. Just apt install hadoop (or something like that), write some .java file with your data processing logic, compile to .jar, execute within Hadoop, done. No need to work with HDFS (if you don't want to), just normal files.
Or write something from scratch. Here I recommend Python, because it works well with all imaginable stuff (file formats, databases, math libs) and its multiprocessing
module provides great tools (like processes, process pools, queues, locks, parallel map, redis-like data server) to make your program somewhat distributed. If you find Python slow just rewrite that slow part to C/C++ and use it from Python (using cffi or Cython).
Most of Python multiprocessing functionality is limited to a single host/computer. I think that's mostly OK, because today hardware usually has many CPU cores. And if not, just launch some AWS EC2 instance with as many cores you like for a few cents per hour.
Let's do some example - word count, the "big data hello world". Using Python. I will use cswiki.xml.bz2 Wikipedia dump that is 618 MB compressed and 2.35 GB uncompressed. It's an XML file, but we will work with it as a text file to keep things simple :)
First - working with one single file is tedious. It's much better to split it to more smaller files so the input data can be easier distributed to multiple workers:
$ bzcat cswiki-20160920-pages-articles-multistream.xml.bz2 | \
split \
--filter='xz -1 > $FILE' \
--additional-suffix=.xz \
--lines=5000000 \
- cswiki-splitted.
Result:
$ ls -1hs cswiki*
618M cswiki-20160920-pages-articles-multistream.xml.bz2
94M cswiki-splitted.aa.xz
77M cswiki-splitted.ab.xz
74M cswiki-splitted.ac.xz
64M cswiki-splitted.ad.xz
62M cswiki-splitted.ae.xz
56M cswiki-splitted.af.xz
54M cswiki-splitted.ag.xz
58M cswiki-splitted.ah.xz
59M cswiki-splitted.ai.xz
15M cswiki-splitted.aj.xz
Here is a straightforward wordcount implementation that uses multiprocessing.Pool:
#!/usr/bin/env python3
import lzma
import multiprocessing
from os import getpid
from pathlib import Path
import re
def main():
input_dir = Path('.')
input_files = [p for p in input_dir.iterdir() if p.name.startswith('cswiki-splitted.')]
pool = multiprocessing.Pool()
partial_results = pool.map(process_file, input_files)
aggregated_results = {}
for pr in partial_results:
for word, count in pr.items():
aggregated_results[word] = aggregated_results.get(word, 0) + count
words_and_counts = aggregated_results.items()
counts_and_words = [(c, w) for w, c in words_and_counts]
counts_and_words.sort(reverse=True)
print('Top 100:', counts_and_words[:100])
def process_file(path):
print('Process {} reading file {}'.format(getpid(), path))
f = lzma.open(str(path), 'rt')
counts = {}
for line in f:
words = re.split(r'\W+', line)
for word in words:
if word != '':
word = word.lower()
counts[word] = counts.get(word, 0) + 1
return counts
if __name__ == '__main__':
main()
Output:
$ ./wordcount.py
Process 2480 reading file cswiki-splitted.ab.xz
Process 2481 reading file cswiki-splitted.ah.xz
Process 2482 reading file cswiki-splitted.aj.xz
Process 2483 reading file cswiki-splitted.aa.xz
Process 2484 reading file cswiki-splitted.af.xz
Process 2485 reading file cswiki-splitted.ac.xz
Process 2486 reading file cswiki-splitted.ai.xz
Process 2487 reading file cswiki-splitted.ae.xz
Process 2482 reading file cswiki-splitted.ad.xz
Process 2481 reading file cswiki-splitted.ag.xz
Top 100: [(4890109, 'quot'), (4774018, 'gt'), (4765677, 'lt'), (4468312, 'id'), (4433742, 'v'), (4377363, 'a'), (2767007, 'na'), (2459957, 'text'), (2278791, 'amp'), (2114275, 'se'), (1971423, 'ref'), (1968093, 'kategorie'), (1799812, 'align'), (1795733, 'nbsp'), (1779981, 'title'), (1662895, '0'), (1592622, '1'), (1489233, 'page'), (1485505, 'je'), (1483416, 'model'), (1476711, 'format'), (1473507, '2'), (1470963, 'ns'), (1468018, 'revision'), (1467530, 'contributor'), (1467479, 'timestamp'), (1467453, 'sha1'), (1429859, 'comment'), (1414549, 'username'), (1261194, 's'), (1177526, '3'), (1159601, 'z'), (1115378, 'http'), (1040230, 'parentid'), (1012821, 'flagicon'), (949947, 'do'), (920863, 'right'), (887196, 'br'), (828466, 'x'), (797722, 've'), (795342, '4'), (783019, 'www'), (778643, '6'), (762929, 'name'), (762220, 'wiki'), (757659, 'i'), (752524, 'space'), (742525, 'xml'), (740244, 'center'), (733809, 'preserve'), (733752, 'wikitext'), (730781, 'o'), (725646, 'cz'), (679842, '5'), (672394, 'datum'), (599607, 'u'), (580936, 'byl'), (563301, 'k'), (550669, 'roce'), (546944, '10'), (536135, 'pro'), (531257, 'jako'), (527321, 'rd1'), (519607, '7'), (515398, 'roku'), (512456, 'od'), (509483, 'style'), (488923, 'za'), (485546, 'titul'), (467147, 'jméno'), (451536, '14'), (448649, '2016'), (447374, 'po'), (444325, 'citace'), (442389, 'jpg'), (424982, '12'), (423842, 'že'), (416419, 'název'), (408796, 'redirect'), (405058, 'minor'), (402733, 'to'), (400355, 'soubor'), (398188, '8'), (395652, 'the'), (393122, '11'), (389370, 'místo'), (368283, '15'), (359019, 'url'), (355302, 'monografie'), (354336, 'odkazy'), (352414, 'jsou'), (348138, 'of'), (344892, 'narození'), (340021, 'vydavatel'), (339462, '2014'), (339219, '20'), (339063, 'jeho'), (336257, '9'), (332598, 'praha'), (328268, 'byla')]
We can see there is lot of noise from XML tags and attributes. That's what you get from running wordcount on a XML file :)
All the file reading and word counting was done in parallel. Only the final aggregation was performed in the main process.
Upvotes: 1
Reputation: 11
I would load each chunk of the file into a buffer using threading, then process the buffer and do what you need. Then to load more buffers remove previous ones from memory and continue. Look into how audio is loaded as that is loaded in buffers like you want to do.
Upvotes: 0
Reputation: 5954
If you load all 10 GB into memory, everything is simple.
If you cannot afford that, then you only load a range of the big file into your buffer at a time.
When you are done with a portion, you slide the window (change the range), load the new range of data into your buffer, so the previous range of data in the buffer is discarded (overwritten).
You may seek the desired location, and might need to go back and forth for loading data. It could be relatively slow, but that's the price you pay for using less memory (space-time tradeoff).
--
You might want to read the source code of programs that can handle huge file. E.g. file archivers.
Upvotes: 0