Tony The Lion
Tony The Lion

Reputation: 63250

Regex splits one string, but not the other

I have a regex for splitting an FTP directory listing from a Windows Server, and it will split the string in one case, and not the other. I'm no regex expert, and wondered if someone could tell me why one of these will be split, and the other won't?

I would like it to split the string so I have the following components:

 DateTime
 IsDirectory/IsFile  (<DIR> is present or not)
 Size
 FileName

(1) will not split the string, (2) will be split

//05-14-14  11:29AM                    0 New Text Document.txt (1)
//05-12-14  12:17PM       <DIR>          TONY (2)

string directorylisting = "05-14-14  11:29AM                    0 New Text Document.txt";
string regex = @"^(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+)\s*$";
var split = Regex.Split(directorylisting, regex);

Upvotes: 0

Views: 89

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

I'm not sure that using the split method is the good way here, I suggest you to use the match method and named captures but with all the directory listing as input string:

string pattern = @"(?mx)^
    (?<date> [0-9]{2}(?:-[0-9]{2}){2} ) [ \t]+
    (?<time> [0-9]{2}:[0-9]{2}[AP]M   ) [ \t]+ 
    (?:
        (?<isDir>    <DIR>  )
      |
        (?<filesize> [0-9]+ )
    ) [ \t]+
    (?(isDir)
        (?<dirname>  [^<>*|"":/\\?\u0001-\u001f\n\r]{1,32768}? )
      |
        (?<filename> [^<>*|"":/\\?\u0001-\u001f\n\r]{1,32768}? )
    ) [^\S\n]* $";

foreach (Match m in Regex.Matches(listing, pattern)) {
    // for each line you can test the group isDir to know if it is 
    // a directory or not
}

(Note: I have tried to understand Microsoft rules for filename/dirname but I'm not 100% sure, feel free to improve these character classes)

If you need to ensure that all the lines are contiguous (it's the case when you use the split method), you can add \G at the begining of the pattern and \n? at the end (after the dollar).

The last character class [^\S\n]* could probably be replaced with \r? (I can't test, I don't use Windows) and [ \t] with [ ] or \t (I let you test it).

Upvotes: 1

Kilazur
Kilazur

Reputation: 3188

The correct regex for this is

(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+\s)*

You have to capture \s in the last part to avoid splitting your string.

Tested on RegexHero. I don't think you need ^ and $ in this specific example.

Upvotes: 0

zx81
zx81

Reputation: 41838

The problem seems to be at the very end: \s*$

The early part of the regex, i.e.

^(\d\d-\d\d-\d\d)\s+(\d\d:\d\d(AM|PM))\s+(<DIR>)?\s+(\d*)\s+([\w\._\-]+)

matches the folders up to "new" and "TONY"

See demo

But there is text after that, and the \s*$ will not match that text as it only allows spaces up to the end of the line.

Upvotes: 1

Related Questions