Abu Dun
Abu Dun

Reputation: 364

"Pattern matching" and extracting in C

I need to parse a lot of filenames (up to 250000 I guess), including the path, and extract some parts out of it.

Here is an example:

Original: /my/complete/path/to/80/01/a9/1d.pdf

Needed: 8001a91d

The "pattern" I am looking for will always begin with "/8". The parts I need to extract form an 8 hex-digits string.

My idea is the following (simplyfied for demonstration):

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";

/* pointer to substring */
char *begin = NULL;

/* final char array to be build */
char *hex = (char*)malloc(9);

/* find "pattern" */
begin = strstr(path, "/8");
if(begin == NULL)
    return 1;

/* jump to first needed character */
begin++;

/* copy the needed characters to target char array */
strncpy(hex,   begin,   2);
strncpy(hex+2, begin+3, 2);
strncpy(hex+4, begin+6, 2);
strncpy(hex+6, begin+9, 2);
strncpy(hex+8, "\0",    1);     

/* print final char array */
printf("%s\n", hex);

This works. I just have the feeling it is not the most clever way. And that there might be some traps I don't see myself.

So, does someone have suggestions what could be dangerous with this pointer-shifting manner? What would be an improvement in your opinion?

Does C provide a functionality to do it like so s|/(8.)/(..)/(..)/(..)\.|\1\2\3\4| ? If I remember right some scripting languages have a feature like that; if you know what I mean.

Upvotes: 1

Views: 1122

Answers (3)

hroptatyr
hroptatyr

Reputation: 4809

In the simple case of just matching /8./../../.. I'd personally go for the strstr() solution myself (no external dependency required). If the rules become more though, you could try a lexer (flex and friends), they support regular expressions.

In your case something like this:

h2           [0-9A-Fa-f]{2}
mymatch      (/{h2}){4}

could work. You'd have to set buffers to the match by side effect though as lexers typically return token identifiers.

Anyway, you'd gain the power of regexps without the dependencies but at the expense of generated (read: unreadable) code.

Upvotes: 0

wildplasser
wildplasser

Reputation: 44240

/* original argument */
char *path = "/my/complete/path/to/80/01/a9/1d.pdf";
char *begin;
char hex[9];
size_t len;

/* find "pattern" */
begin = strstr(path, "/8");
if (!begin) return 1;

// sanity check
len = strlen(begin);
if (len < 12) return 2; 

   // more sanity
if (begin[3] != '/' || begin[6] != '/' ||  begin[9] != '/' ) return 3;

memcpy(hex,   begin+1, 2);
memcpy(hex+2, begin+4, 2);
memcpy(hex+4, begin+7, 2);
memcpy(hex+6, begin+10, 2);
       hex[8] = 0;     

   // For additional sanity, you could check for valid hex characters here
/* print final char array */
printf("%s\n", hex);

Upvotes: 0

luser droog
luser droog

Reputation: 19494

C itself doesn't provide this, but you can use POSIX regex. It's a full-featured regular expression library. But for a pattern as simple as yours, this probably is the best way.

BTW, prefer memcpy to strncpy. Very few people know what strncpy is good for. And I'm not one of them.

Upvotes: 2

Related Questions