Cheries
Cheries

Reputation: 892

How to split a file by using string as identifier with python?

I have a huge text file and need to split it to some file. In the text file there is an identifier to split the file. Here is some part of the text file looks like:

Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *

-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

My expectation is split the file by mapping the string "Starting The Process". So if I have a text file like above example, then the file will split to 3 files and each file has differen content. For example:

file1
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *


file2
-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

file 3
-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

here is what i've tried:

logfile = "E:/DATA/result.txt"
with open(logfile, 'r') as text_file:
    lines = text_file.readlines()
    for line in lines:
        if "Starting The Process..." in line:
            print(line)

I am only able to find the line with the string, but I don't know how to get the content of each line after split to 3 parts and output to new file.

Is it possible to do it in Python? Thank you for any advice.

Upvotes: 0

Views: 402

Answers (2)

ChaoS Adm
ChaoS Adm

Reputation: 899

An alternative solution if the dashes in the file are of fixed length could be:

with open('file.txt', 'r') as f: 
split_text = f.read().split('--------------------------------------------------')
split_text.pop(0) # To remove the Copyright message at the start

for i in range(0, len(split_text) - 1, 2): 
    with open(f'file{int(i/2)}.txt', 'w') as temp: 
        temp_txt = ''.join(split_text[i:i+2])
        temp.write(temp_txt)    

Essentially, I am just splitting on the basis of those dashes and joining every consecutive element. This way you keep the info about the timestamp with the content in each file.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521457

Well if the file is small enough to comfortably fit into memory (say 1GB or less), you could read the entire file into a string and then use re.findall:

with open('data.txt', 'r') as file:
    data = file.read()
    parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)

cnt = 1
for part in parts:
    output = open('file ' + str(cnt), 'w')
    output.write(part)
    output.close()
    cnt = cnt + 1

Upvotes: 1

Related Questions