midas06
midas06

Reputation: 2001

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine. There are a few complications: 1) The file may be in use 2) The files can be quite large (100mb+) 3) Each entry may be multi-line

To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.

To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.

The two main issues are: 1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?) 2) How do i deal with multi-line entries when splitting up the lines among threads?

UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.

Upvotes: 10

Views: 7432

Answers (9)

Daniel Bişar
Daniel Bişar

Reputation: 2773

The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.

It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message? You have several options:

  • parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
  • another option as mentioned before is use compression for the network communication
  • you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.

Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.

Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

Upvotes: 0

VVS
VVS

Reputation: 19612

If you can copy the file, you can read it. So there's no need to copy it in the first place.

EDIT: use the FileStream class to have more control over the access and sharing modes.

new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)

should do the trick.

Upvotes: 1

Andrew Edgecombe
Andrew Edgecombe

Reputation: 40382

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.

This is one of the reasons that so many databases use stored procedures that are run at the server end.

Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.

If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer. There are, for example, sftp servers available that will perform compression on the fly. At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

Upvotes: 2

SquareCog
SquareCog

Reputation: 19666

Use compression for transfer.

If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

Upvotes: 1

Mark Brackett
Mark Brackett

Reputation: 85665

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.

But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).

Multithreading won't help, as you'll be disk or network bound anyway.

Upvotes: 1

CiNN
CiNN

Reputation: 9880

i think using compression (deflate/gzip) would help

Upvotes: 0

Chris Tybur
Chris Tybur

Reputation: 1622

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.

Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

Upvotes: 0

Luke
Luke

Reputation: 18973

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.

The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

Upvotes: 1

Wesley Tarle
Wesley Tarle

Reputation: 668

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.

Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.

Your other great option is to put the log parser on the server, and download the results.

Upvotes: 2

Related Questions