Federer
Federer

Reputation: 34745

Regex for Two Digits and a Hyphen

I've got a very long text file as a result of a test last night. Stupidly I forgot to format it correctly with "\n". A sample is:

"01-someText151645.txt,Wed Feb 1 16:15:18 2012,1328112918.57801-HalfMeg151646.txt,Wed Feb 1 16:15:18 2012,1328112918.578"... on and on.

As you can see there is no space between the end of the epoch timestamp and the text file name. Fortunately each text file starts with two numbers and a hyphen. So the above sample should look like this:

01-someText151645.txt,Wed Feb  1 16:15:18 2012,1328112918.578
01-someText151646.txt,Wed Feb  1 16:15:18 2012,1328112918.578

Unfortunately a previous project where I had lots of Regex parsing isn't to hand and thus need a little help getting a regex for this. My plan is to then use a re.findall(regex, sample) to get the information I want.

Edit: Just to explicitly say that each line has a text file name, a date and epoch timestamp, all separated by "," (no spaces). Each file begins with 2 digits and a hyphen. So that is: textfile,date,epoch, textfile= digit,digit,-

Upvotes: 5

Views: 14108

Answers (3)

jsvk
jsvk

Reputation: 1729

Here's what I've thrown together, manipulate it to suit:

import re

m = """01-someText151645.txt,Wed Feb 1 16:15:18 2012,1328112918.57801-HalfMeg151646.txt,Wed Feb 1 16:15:18 2012,1328112918.578"""

print(m)

addNewLineBefore = lambda matchObject: "\n" + matchObject.group(0)

print ( re.sub(r'\d{2}-',addNewLineBefore,m) )

It assumes, that the \d{2}- match is unique to the beginning of a line. If there's a possibility they appear within the line, such as in the filename, I can edit this answer to accommodate

EDIT: In the event you don't want to read your entire file into memory, you can use a buffer:

import re
input = open("infile","r")
output = open("outfile","w")

oneLine = re.compile(r"""(
        \d{2}-  # the beginning of the line
        .+?     # the middle of the line
        \.\d{3} # the dot and three digits at the end
)""", re.X)

while buffer:
    buffer = input.read(6000) # adjust this to suit
    #newbuffer = re.split(r'(\d{2}-.+?\.\d{3})',buffer) # I'll use the commented re object above
    newbuffer = oneLine.split(buffer)
    newbuffer = filter(None,newbuffer)
    output.write( "\n".join(newbuffer) )
input.close()
output.close()

This shouldn't be used if error checking and efficiency are necessities. From what I understand, this is a very controlled and informal environment

Upvotes: 6

Borodin
Borodin

Reputation: 126722

If your file is sufficiently small to allow it to be read into memory all at once, then you could simply split it on a lookahead regex

re.split('(?=\d\d-)', contents)

or to insert newlines where they belong

re.sub('(?=\d\d-)', "\n", contents)

Upvotes: 1

Barry Chapman
Barry Chapman

Reputation: 6780

Here, try this:

([0-9]{2}-[a-zA-Z]{5,}[0-9]{5,}\.txt){1,}

That would match (closely but loosely) the format of your filename. You can adjust to your needs.

Do a split on this, and then separate the file accordingly.

Upvotes: 1

Related Questions