Mia
Mia

Reputation: 1

How to parse a complex text file using Python string methods or regex and export into tabular form

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)

I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...

I am supposed to turn my text file from this

https://pastebin.com/ZM8EPu0p

and export it into a more readable format like this- example output is below

Here is what I have so far.

def readFile(court):
    csv_rows = []
    # read and split txt file into pages & chunks of data by pagragraph
    with open(court, "r") as file:
        data_chunks = file.read().split("\n\n")

        for chunk in data_chunks:
            chunk = chunk.strip  # .strip removes useless spaces
            if str(data_chunks[:4]).isnumeric():  # if first 4 characters are digits
                entry = None  # initialize an empty dictionary
            elif (
                str(data_chunks).isspace() and entry
            ):  # if we're on an empty line and the entry dict is not empty
                csv_rows.DictWriter(dialect="excel")  # turn csv_rows into needed output
                entry = {}
            else:

                # parse here?

                print(data_chunks)

    return csv_rows

readFile("/Users/mia/Desktop/School/programming/court.txt")

Upvotes: 0

Views: 309

Answers (1)

Florin C.
Florin C.

Reputation: 613

It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks. First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp

Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).

And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".

And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

Upvotes: 1

Related Questions