Reputation: 34745
I've got a very long text file as a result of a test last night. Stupidly I forgot to format it correctly with "\n"
. A sample is:
"01-someText151645.txt,Wed Feb 1 16:15:18 2012,1328112918.57801-HalfMeg151646.txt,Wed Feb 1 16:15:18 2012,1328112918.578"... on and on.
As you can see there is no space between the end of the epoch timestamp and the text file name. Fortunately each text file starts with two numbers and a hyphen. So the above sample should look like this:
01-someText151645.txt,Wed Feb 1 16:15:18 2012,1328112918.578 01-someText151646.txt,Wed Feb 1 16:15:18 2012,1328112918.578
Unfortunately a previous project where I had lots of Regex parsing isn't to hand and thus need a little help getting a regex for this. My plan is to then use a re.findall(regex, sample)
to get the information I want.
Edit: Just to explicitly say that each line has a text file name, a date and epoch timestamp, all separated by "," (no spaces). Each file begins with 2 digits and a hyphen. So that is: textfile,date,epoch
, textfile= digit,digit,-
Upvotes: 5
Views: 14108
Reputation: 1729
Here's what I've thrown together, manipulate it to suit:
import re
m = """01-someText151645.txt,Wed Feb 1 16:15:18 2012,1328112918.57801-HalfMeg151646.txt,Wed Feb 1 16:15:18 2012,1328112918.578"""
print(m)
addNewLineBefore = lambda matchObject: "\n" + matchObject.group(0)
print ( re.sub(r'\d{2}-',addNewLineBefore,m) )
It assumes, that the \d{2}-
match is unique to the beginning of a line. If there's a possibility they appear within the line, such as in the filename, I can edit this answer to accommodate
EDIT: In the event you don't want to read your entire file into memory, you can use a buffer:
import re
input = open("infile","r")
output = open("outfile","w")
oneLine = re.compile(r"""(
\d{2}- # the beginning of the line
.+? # the middle of the line
\.\d{3} # the dot and three digits at the end
)""", re.X)
while buffer:
buffer = input.read(6000) # adjust this to suit
#newbuffer = re.split(r'(\d{2}-.+?\.\d{3})',buffer) # I'll use the commented re object above
newbuffer = oneLine.split(buffer)
newbuffer = filter(None,newbuffer)
output.write( "\n".join(newbuffer) )
input.close()
output.close()
This shouldn't be used if error checking and efficiency are necessities. From what I understand, this is a very controlled and informal environment
Upvotes: 6
Reputation: 126722
If your file is sufficiently small to allow it to be read into memory all at once, then you could simply split it on a lookahead regex
re.split('(?=\d\d-)', contents)
or to insert newlines where they belong
re.sub('(?=\d\d-)', "\n", contents)
Upvotes: 1
Reputation: 6780
Here, try this:
([0-9]{2}-[a-zA-Z]{5,}[0-9]{5,}\.txt){1,}
That would match (closely but loosely) the format of your filename. You can adjust to your needs.
Do a split on this, and then separate the file accordingly.
Upvotes: 1