Reputation: 63250
I have a regex for splitting an FTP directory listing from a Windows Server, and it will split the string in one case, and not the other. I'm no regex expert, and wondered if someone could tell me why one of these will be split, and the other won't?
I would like it to split the string so I have the following components:
DateTime
IsDirectory/IsFile (<DIR> is present or not)
Size
FileName
(1) will not split the string, (2) will be split
//05-14-14 11:29AM 0 New Text Document.txt (1)
//05-12-14 12:17PM <DIR> TONY (2)
string directorylisting = "05-14-14 11:29AM 0 New Text Document.txt";
string regex = @"^(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+)\s*$";
var split = Regex.Split(directorylisting, regex);
Upvotes: 0
Views: 89
Reputation: 89584
I'm not sure that using the split method is the good way here, I suggest you to use the match method and named captures but with all the directory listing as input string:
string pattern = @"(?mx)^
(?<date> [0-9]{2}(?:-[0-9]{2}){2} ) [ \t]+
(?<time> [0-9]{2}:[0-9]{2}[AP]M ) [ \t]+
(?:
(?<isDir> <DIR> )
|
(?<filesize> [0-9]+ )
) [ \t]+
(?(isDir)
(?<dirname> [^<>*|"":/\\?\u0001-\u001f\n\r]{1,32768}? )
|
(?<filename> [^<>*|"":/\\?\u0001-\u001f\n\r]{1,32768}? )
) [^\S\n]* $";
foreach (Match m in Regex.Matches(listing, pattern)) {
// for each line you can test the group isDir to know if it is
// a directory or not
}
(Note: I have tried to understand Microsoft rules for filename/dirname but I'm not 100% sure, feel free to improve these character classes)
If you need to ensure that all the lines are contiguous (it's the case when you use the split method), you can add \G
at the begining of the pattern and \n?
at the end (after the dollar).
The last character class [^\S\n]*
could probably be replaced with \r?
(I can't test, I don't use Windows) and [ \t]
with [ ]
or \t
(I let you test it).
Upvotes: 1
Reputation: 3188
The correct regex for this is
(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+\s)*
You have to capture \s in the last part to avoid splitting your string.
Tested on RegexHero. I don't think you need ^ and $ in this specific example.
Upvotes: 0
Reputation: 41838
The problem seems to be at the very end: \s*$
The early part of the regex, i.e.
^(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+)
matches the folders up to "new" and "TONY"
See demo
But there is text after that, and the \s*$
will not match that text as it only allows spaces up to the end of the line.
Upvotes: 1