Reputation: 892
I have a huge text file and need to split it to some file. In the text file there is an identifier to split the file. Here is some part of the text file looks like:
Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
--------------------------------------------------
Mon 11/19/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
...
exit
---------------------
list volume
list partition
exit
---------------------
Volume 0 is the selected volume.
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
* Disk 0 Online 238 GB 136 GB *
--------------------------------------------------
Tue 11/20/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
....
SERVICE_NAME: vds
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
---------------------
*exit /b 0
File not found - *.*
0 File(s) copied
--------------------------------------------------
Wed 11/21/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
---------------------
*exit /b 0
11/19/2021 08:34 AM <DIR> .
11/19/2021 08:34 AM <DIR> ..
11/19/2021 08:34 AM 0 SL
1 File(s) 0 bytes
2 Dir(s) 80,160,923,648 bytes free
My expectation is split the file by mapping the string "Starting The Process". So if I have a text file like above example, then the file will split to 3 files and each file has differen content. For example:
file1
--------------------------------------------------
Mon 11/19/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
...
exit
---------------------
list volume
list partition
exit
---------------------
Volume 0 is the selected volume.
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
* Disk 0 Online 238 GB 136 GB *
file2
--------------------------------------------------
Tue 11/20/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
....
SERVICE_NAME: vds
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
---------------------
*exit /b 0
File not found - *.*
0 File(s) copied
file 3
--------------------------------------------------
Wed 11/21/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
---------------------
*exit /b 0
11/19/2021 08:34 AM <DIR> .
11/19/2021 08:34 AM <DIR> ..
11/19/2021 08:34 AM 0 SL
1 File(s) 0 bytes
2 Dir(s) 80,160,923,648 bytes free
here is what i've tried:
logfile = "E:/DATA/result.txt"
with open(logfile, 'r') as text_file:
lines = text_file.readlines()
for line in lines:
if "Starting The Process..." in line:
print(line)
I am only able to find the line with the string, but I don't know how to get the content of each line after split to 3 parts and output to new file.
Is it possible to do it in Python? Thank you for any advice.
Upvotes: 0
Views: 402
Reputation: 899
An alternative solution if the dashes in the file are of fixed length could be:
with open('file.txt', 'r') as f:
split_text = f.read().split('--------------------------------------------------')
split_text.pop(0) # To remove the Copyright message at the start
for i in range(0, len(split_text) - 1, 2):
with open(f'file{int(i/2)}.txt', 'w') as temp:
temp_txt = ''.join(split_text[i:i+2])
temp.write(temp_txt)
Essentially, I am just splitting on the basis of those dashes and joining every consecutive element. This way you keep the info about the timestamp with the content in each file.
Upvotes: 0
Reputation: 521457
Well if the file is small enough to comfortably fit into memory (say 1GB or less), you could read the entire file into a string and then use re.findall
:
with open('data.txt', 'r') as file:
data = file.read()
parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)
cnt = 1
for part in parts:
output = open('file ' + str(cnt), 'w')
output.write(part)
output.close()
cnt = cnt + 1
Upvotes: 1