Reputation:
I am writing a program that needs to search a LARGE text document for a large collection of words. The words are all file names, with underscores in them (eg, this_file_name). I know how to open and iterate through a text document, but I'm curious whether I should use Regex to search for these names, and if so, what kind of reg. ex. sequence should I use? I've tried
Regex r = new Regex("?this\_file\_name");
but I get an invalid argument error every time.
Upvotes: 2
Views: 6101
Reputation: 8763
If I understand your problem correctly, I think a regular expression is the wrong tool for the job. I'll assume your file names are separated with some kind of delimiter (like commas or new lines).
If this is the case, use String.Split
to put all file names into an array, sort the array alphabetically, then perform a binary search against the sorted array for each item in the "collection" you mentioned. I'm pretty sure that this is the most computationally efficient way to perform the task.
When you say "LARGE" text files, think about their size relative to the machines this program will be running on. A 1 MB text file may seem large, but it will easily fit into the memory of a machine with 2 GB RAM. If the file is considerably larger compared to the memory of your client machines, read the file in chunks at a time. This is called buffering.
Upvotes: 0
Reputation: 556
Perhaps break your document into tokens by splitting on space or non word characters first?
After, I think a regex that might work for you would look something like this:
Regex r = new Regex(@"([\w_]+)");
Upvotes: 1
Reputation: 40235
It would be helpful to see a sample of the source text. but maybe this helps
var doc = @"asdfsdafjkj;lkjsadf asddf jsadf asdfj;lksdajf
sdafjkl;sjdfaas sadfj;lksadf sadf jsdaf jf sda sdaf asdf sad
jasfd sdf sadf sadf sdajlk;asdf
this_file_name asdfsadf asdf asdf asdf
asdf sadf asdfj asdf sdaf sadfsadf
sadf asdf this_file_name asdf asdf ";
var reg = new Regex("this_file_name", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var matches = reg.Matches(doc);
Upvotes: 2