Sasha Hoffman
Sasha Hoffman

Reputation: 23

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.

I have a txt file of Letters to the Editor that I need to split into their own individual files.

The files are all formatted in relatively the same way with:

For once, before offering such generous but the unasked for advice, put yourselves in...

Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...

Why is it that The Times does not urge totalitarian Arab slates and terrorist...

PAUL STONEHILL Los Angeles

There you go again. Your editorial again makes groundless criticisms of the Israeli...

On Dec. 7 you called proportional representation “bizarre," despite its use in the...

Proportional representation distorts Israeli politics? Huh? If Israel changes the...

MATTHEW SHUGART Laguna Beach

Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...

Although the mayor did not support Proposition U (the slow-growth initiative) his...

If West Los Angeles is any indication of the no-growth policy, where do we go from here?

MARJORIE L. SCHWARTZ Los Angeles

I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.

I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.

The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.

import re
thefile = raw_input('Filename to split: ')
name_occur = [] 
full_file = []
pattern = re.compile("^[A-Z]{4,}")

with open (thefile, 'rt') as in_file:
    for line in in_file:
        full_file.append(line)
        if pattern.search(line):
            name_occur.append(line) 

totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)

while letters <= totalFiles:
    f1 = open(thefile + '-' + str(letters) + ".txt", "a")
    doIHaveToCopyTheLine = False
    ignoreLines = False
    for line in full_file:
        if not ignoreLines:
            f1.write(line)
            full_file.remove(line)
        if pattern.search(line):
            doIHaveToCopyTheLine = True
            ignoreLines = True
    letters += 1
    f1.close()

I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.

Upvotes: 2

Views: 4399

Answers (3)

gregory
gregory

Reputation: 13023

While the other answer is suitable, you may still be curious about using a regex to split up a file.

   smallfile = None
   buf = ""
   with  open ('input_file.txt', 'rt') as f:
      for line in f:
          buf += str(line)
          if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
              if smallfile:
                  smallfile.close()
              match = re.findall(r'^([A-Z\s\.]+\b)' , line)
              smallfile_name = '{}.txt'.format(match[0])
              smallfile = open(smallfile_name, 'w')
              smallfile.write(buf)
              buf = ""
      if smallfile:
          smallfile.close()

Upvotes: 1

mVChr
mVChr

Reputation: 50205

I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.

import string

def split_letters(fullpath):
    current_letter = []
    letter_index = 1
    fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)

    with open(fullpath, 'r') as letters_file:
        letters = letters_file.readlines()
    for line in letters:
        words = line.split()
        upper_words = []
        for word in words:
            upper_word = ''.join(
                c for c in word if c in string.ascii_uppercase)
            upper_words.append(upper_word)

        len_upper_words = len(upper_words)
        first_word_upper = len_upper_words and len(upper_words[0]) > 1
        second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
        third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
        if first_word_upper and (second_word_upper or third_word_upper):
            current_letter.append(line)
            new_filename = '{0}{1}.{2}'.format(
                fullpath_base, letter_index, fullpath_ext)
            with open(new_filename, 'w') as new_letter:
                new_letter.writelines(current_letter)
            current_letter = []
            letter_index += 1

        else:
            current_letter.append(line)

I tested it on your sample input and it worked fine.

Upvotes: 1

Related Questions