Precious
Precious

Reputation: 237

extracting first letter of a String with Regex

I am new to the RegEx and i want to be able to search and replace particular texts in my text-file. I was able to do most of the searches but here is one that i couldn't quite get the hang of it. I think i should be using the look around/look ahead/look behind. But the tool i'm using says syntax error. Basically here is the data in my file

[2010-01-15 06:18:10.203] [0x00001388] [SHDNT] Shutdown Count Down = 2/5

[2010-01-15 06:18:11.203] [0x00001388] [SHDNT] Shutdown Count Down = 3/5

And i want to be able to capture in my search the '[' and ']' around the date. I thought of finding the '[' using some criteria like( '[' followed by [0-9][0-9] meaning two digits) and the ']' with (']' proceeding'.[0-9][0-9][0-9]' meaning dot and 3digits).

I tried this but it gives error \[(?=[0-9][0-9]) for the first search. Is doesnt allow me to put ? right after the parenthesis.

How should i do the search?

Thanks in advance

EDITED TO ADD

To make it clear i am not using RegEx with any programming language. I am using a Text Editor that has the search and replace function which allows the pattern search. So i want to remove the square brackets around the date. But not change anything else in my file.

Upvotes: 2

Views: 13770

Answers (7)

Owen S.
Owen S.

Reputation: 7855

The following regular expression:

^\[([^\]]+)\]

will capture the date at the beginning of the string plus square brackets, and will put the stuff in between the square brackets into a group that can be extracted by itself.

Note that your text editor may have a slightly different syntax. Here's how this breaks down:

^ = beginning of line/string
\[, \] = literal [ and ] characters
() = signifies a group to capture
[^\]] = matches any character _except_ a close bracket
        (this keeps the match from being too greedy)
+ = one or more of the previous

EDIT: This assumes your regex facility supports groups (which most do). The easiest way to explain groups is just to show you how they work with one such engine. In the Python interpreter:

>>> import re
>>> s = '[2010-01-15 06:18:10.203] [0x00001388] [SHDNT] ...'
>>> r = re.compile(r'^\[([^\]]+)\]')
>>> m = r.search(s)

This creates a regular expression object and searches the string for the first set of text that matches it. The result is returned in a match object:

>>> m
<_sre.SRE_Match object at 0x1004d9558>

To get the entire set of text that was matched, the Python convention is to invoke group() on the match object:

>>> m.group()
'[2010-01-15 06:18:10.203]'

and to get just the stuff in parentheses, I pass the number of the group I want (in this case there's just one set of parens, so just one group):

>>> m.group(1)
'2010-01-15 06:18:10.203'

If I perform a replace instead of a search, I use the sub function. Sub takes the string I want to replace the full match by, followed by the input string, and returns the string with the replacement performed if a match was found:

>>> r.sub('spam spam spam', s)
'spam spam spam [0x00001388] [SHDNT] ...'

However, the replacement string supports escape sequences that refer to specific values of groups captured by the match. A group substitution is indicated by \N, where N is the number of the group. Hence:

>>> r.sub(r' \1 ', s)
' 2010-01-15 06:18:10.203  [0x00001388] [SHDNT] ...'

which is what you want.

Upvotes: 3

ghostdog74
ghostdog74

Reputation: 342313

keep it simple. There's no need to use regular expression. If the date/time part is all you want, then use fields and field delimiters. here's an awk expression. Just print out the first column (closing square bracket as field delimiters.)

$ cat file
[2010-01-15 06:18:10.203] [0x00001388] [SHDNT] Shutdown Count Down = 2/5
[2010-01-15 06:18:11.203] [0x00001388] [SHDNT] Shutdown Count Down = 3/5

$ awk -F"]" '{print $1"]"}' file
[2010-01-15 06:18:10.203]
[2010-01-15 06:18:11.203]

or just print out fields 1 and 2 using spaces as delimiters

$ awk '{print $1,$2}' file
[2010-01-15 06:18:10.203]
[2010-01-15 06:18:11.203]

Update: To remove the square brackets, simply use gsub() or sub() on fields 1 and 2

$ awk '{gsub(/^\[/,"",$1);gsub(/\]$/,"",$2)}1' file
2010-01-15 06:18:10.203 [0x00001388] [SHDNT] Shutdown Count Down = 2/5
2010-01-15 06:18:11.203 [0x00001388] [SHDNT] Shutdown Count Down = 3/5

Upvotes: 2

sarnold
sarnold

Reputation: 104040

Ah, thanks for your additional comment in one of the answers.

In vim, I'd probably use the visual selection tool: put the cursor on the first [, type ^V, G (to get to the end of the file), then x to delete the column. Then repeat with the first ] character, ^V, G (but G will put the cursor on the wrong character -- so use l or the right-arrow-key to move over to the ]) and then type x to delete the column.

If it didn't line up perfectly in columns (perhaps the .203 could be fewer characters, say .2) then I would do this:

:%s/^\[//
:%s/\(\d\)] /\1 /

Noting of course that the second regex is much more brittle; it'll delete the first ] that is between a digit and a space on every line. Non-vim won't be so annoying about escaping ( and ).

Of course, if you're not using a vi-clone, hopefully this can translate well enough. :)

Upvotes: 0

p.campbell
p.campbell

Reputation: 100567

Not entirely sure you need a regular expression here. If it's a matter of finding the first character, or determining the text within the square brackets. Perhaps I've misunderstood your question?

C# example:

LINQ:

string[] firsts = myFile.ReadAllLines().Select(f=>f[0]);

Looping with foreach:

string[] allLines = myFile.ReadAllLines();
foreach (string line in allLines)
{
    char firstChar= line[0];
    Console.WriteLine("First char: " + firstChar.ToString());

    if (firstChar = '[')
    {
       int closing = line.IndexOf(']');
       string textWithin = line.SubString(0, closingSquare-1);
       Console.WriteLine("Found this text within the square brackets: " + textWithin);
    }
}

Upvotes: 0

kerkeslager
kerkeslager

Reputation: 1396

I agree with ghostdog that you should keep it simple, but you can keep it simple with regular expressions too:

  1. ^ matches the beginning of a line.
  2. . matches any single character.
  3. *? matches the previous thing zero or more times NON-GREEDILY, meaning it doesn't take more than it has to to make the rest of the regex match.

Put this together and you get ^.*?\] which matches from the beginning of the line to the first ] that it sees.

EDIT: Just saw your reply to ghostdog, which clarified the problem. It's still easier to match the entire date with the braces. Once you have that, just replace the entire string with itself, minus the first and last character. I don't know what language you're using, but in Python it would be something like this:

new_string = re.sub(r'^.*?\]',original_string,lambda m:m.group()[1:-1])

Upvotes: 0

msw
msw

Reputation: 43487

Because your input format is so rigid take the really simple way:

$ cut -c 2-24 <<EOF
[2010-01-15 06:18:10.203] [0x00001388] [SHDNT] Shutdown Count Down = 2/5
[2010-01-15 06:18:11.203] [0x00001388] [SHDNT] Shutdown Count Down = 3/5
EOF

2010-01-15 06:18:10.203
2010-01-15 06:18:11.203

Upvotes: 0

sarnold
sarnold

Reputation: 104040

I'm not sure you need to use the lookahead or lookbehind assertions in your regexp:

 sarnold@haig:/tmp$ cat date.pl
 #!/usr/bin/perl -w

 while(<>) {
     /^(\[\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d\])/;
     print "$1\n";
 }
 sarnold@haig:/tmp$ cat data
 [2010-01-15 06:18:10.203] [0x00001388] [SHDNT] Shutdown Count Down = 2/5
 [2010-01-15 06:18:11.203] [0x00001388] [SHDNT] Shutdown Count Down = 3/5
 sarnold@haig:/tmp$ ./date.pl data
 [2010-01-15 06:18:10.203]
 [2010-01-15 06:18:11.203]

I couldn't tell from your description if you do want the [ and ] around your date, or if you don't want them. If you don't want the square brackets, move them outside the parens:

     /^\[(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d)\]/;

sarnold@haig:/tmp$ ./date.pl data
2010-01-15 06:18:10.203
2010-01-15 06:18:11.203

Note that I've also anchored the regexp at the beginning of the line, in case the output includes a date-time thing in bracket somewhere else. Also, I over-specified the date-time compared to your example. Consider it paranoia. If you wanted to replace \d\d\d\d with \d{4} you could, but in this example I find the longer form more readable.

Upvotes: 1

Related Questions