Reputation: 2509
Here is the situation:
I am making a small prog to parse server log files.
I tested it with a log file with several thousand requests (between 10000 - 20000 don't know exactly)
What i have to do is to load the log text files into memory so that i can query them.
This is taking the most resources.
The methods that take the most cpu time are those (worst culprits first):
string.split - splits the line values into a array of values
string.contains - checking if the user agent contains a specific agent string. (determine browser ID)
string.tolower - various purposes
streamreader.readline - to read the log file line by line.
string.startswith - determine if line is a column definition line or a line with values
there were some others that i was able to replace. For example the dictionary getter was taking lots of resources too. Which i had not expected since its a dictionary and should have its keys indexed. I replaced it with a multidimensional array and saved some cpu time.
Now i am running on a fast dual core and the total time it takes to load the file i mentioned is about 1 sec.
Now this is really bad.
Imagine a site that has tens of thousands of visits a day. It's going to take minutes to load the log file.
So what are my alternatives? If any, cause i think this is just a .net limitation and i can't do much about it.
EDIT:
If some of you gurus want to look at the code and find the problem here are my code files:
The function that takes the most resources is by far LogEntry.New The function that loads all the data is called Data.Load
Total amount of LogEntry objects created: 50 000. Time taken: 0.9 - 1.0 seconds.
CPU: amd phenom II x2 545 3ghz.
not multithreaded
Upvotes: 2
Views: 1407
Reputation: 7475
You can do several things :
A windows service which continuously parse the log each time it is changed. Then your UI request this service.
Or you can parse it every minutes or more and cache the result, do you really need it to be in real time ? maybe it only need to be parsed once ?
Upvotes: 0
Reputation: 161821
Have you considered loading log entries into a database and querying from there? This way, you'd be able to skip parsing log entries you've already stored in the database.
Upvotes: 0
Reputation: 1503090
Without seeing your code, it's hard to know whether you've got any mistakes there which are costing you performance. Without seeing some sample data, we can't reasonably try experiments to see how we'd fare ourselves.
What was your dictionary key before? Moving to a multi-dimensional array sounds like an odd move - but we'd need more information to know what you were doing with the data before.
Note that unless you're explicitly parallelizing the work, having a dual core machine won't make any difference. If you're really CPU bound then you could parallelize - although you'd need to do so carefully; you would quite probably want to read a "chunk" of text (several lines) and ask one thread to parse it rather than handing off one line at a time. The resulting code would probably be significantly more complex though.
I don't know whether one second for 10,000 lines is reasonable or not, to be honest - if you could post some sample data and what you need to do with it, we could give more useful feedback.
EDIT: Okay, I've had a quick look at the code. A few thoughts...
Most importantly, this probably isn't something you should do "on demand". Instead, parse periodically as a background process (e.g. when logs roll over) and put the interesting information in a database - then query that database when you need to.
However, to optimise the parsing process:
StreamReader
is at the end - just call ReadLine
until the result is Nothing
.line.StartsWith("#")
- I'd have to test.LineFormat
class which can cope with any field names, but specifically remembers the index of fields that you know you're going to want. This also avoids copying the complete list of fields for each log entry, which is pretty wasteful.There are probably other things, but I'm afraid I don't have the time to go into them now :(
Upvotes: 4
Reputation: 15867
You could try lazy loading: For example, read the file 4096 bytes at a time, look for line endings and save all line endings in an array. Now, if some part of your program wants the LogEntry N, look up the start position of that line, read it and create a LogEntry object on the fly. (This is a bit easier with Memory Mapped files.) As possible optimizatons, if the calling code usually needs consecutive LogEnties, your code could e.g. read-ahead the next 100 log entries automatically. You could cache the last 1000 entries that were accessed.
Upvotes: 0
Reputation: 15139
Have you already looked at memory mapped files? (thats in .NET 4.0 though)
EDIT :- Also, Is it possible to split those large files into smaller ones and parsing the smaller files. This is something we have done in some of our large files and that was faster than parsing giant files.
Upvotes: 2
Reputation: 227
You could try RegEx. Or change the business process so the file can be loaded at that speed more conveniently.
Upvotes: 1