watisit
watisit

Reputation: 303

Splitting the elements of a list into a list and then splitting them again

This is a sample of the raw text i'm reading:

ID: 00000001
SENT: to do something
to    01573831
do    02017283
something    03517283

ID: 00000002
SENT: just an example
just    06482823
an    01298744
example    01724894

Right now I'm trying to split it into a lists of lists of lists.

Topmost level list: By the ID so 2 elements here (done)

Next level: Within each ID, split by newlines

Last level: Within each line split the word and ID, for the lines beginning with ID or SENT, it doesn't matter if they are split or not. Between the word and their ID is an indent (\t)

Current code:

f=open("text.txt","r")
raw=list(f)
text=" ".join(raw)
wordlist=text.split("\n \n ") #split by ID
toplist=wordlist[:2] #just take 2 IDs

Edit: I was going to cross-reference the words to another text file to add their word classes which is why i asked for a lists of lists of lists.

Steps:

1) Use .append() to add on word classes for each word

2) Use "\t".join() to connect a line together

3) Use "\n".join() to connect different lines in an ID

4) "\n\n".join() to connect all the IDs together into a string

Output:

ID: 00000001
SENT: to do something
to    01573831    prep
do    02017283    verb
something    03517283    noun

ID: 00000002
SENT: just an example
just    06482823    adverb
an    01298744    ind-art
example    01724894    noun

Upvotes: 0

Views: 379

Answers (4)

Nick Burns
Nick Burns

Reputation: 983

would this work for you?:

Top - level (which you have done)

def get_parent(text, parent):
    """recursively walk through text, looking for 'ID' tag"""

    # find open_ID and close_ID
    open_ID = text.find('ID')
    close_ID = text.find('ID', open_ID + 1)

    # if there is another instance of 'ID', recursively walk again
    if close_ID != -1:
        parent.append(text[open_ID : close_ID])
        return get_parent(text[close_ID:], parent)
    # base-case 
    else:
        parent.append(text[open_ID:])
        return

Second - level: split by newlines:

def child_split(parent):
    index = 0
    while index < len(parent):
        parent[index] = parent[index].split('\n')
        index += 1

Third - level: split the 'ID' and 'SENT' fields

def split_field(parent, index):
if index < len(parent):
    child = 0
    while child < len(parent[index]):
        if ':' in parent[index][child]:
            parent[index][child] = parent[index][child].split(':')
        else:
            parent[index][child] = parent[index][child].split()
        child += 1
    return split_field(parent, index + 1)
else:
    return

Running it all together:

def main(text):
    parent = []
    get_parent(text, parent)
    child_split(parent)
    split_field(parent, 0)

The result is quite nested, perhaps it can be cleaned up somewhat? Or perhaps the split_fields() function could return a dictionary?

Upvotes: 0

jamylak
jamylak

Reputation: 133764

I'm not sure exactly what output you need but you can adjust this to fit your needs (This uses the itertools grouper recipe):

>>> from itertools import izip_longest
>>> def grouper(n, iterable, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return izip_longest(fillvalue=fillvalue, *args)

>>> with open('text.txt') as f:
        print [[x.rstrip().split(None, 1) for x in g if x.rstrip()]
               for g in grouper(6, f, fillvalue='')]


[[['ID:', '00000001'], ['SENT:', 'to do something'], ['to', '01573831'], ['do', '02017283'], ['something', '03517283']], 
 [['ID:', '00000002'], ['SENT:', 'just an example'], ['just', '06482823'], ['an', '01298744'], ['example', '01724894']]]

Upvotes: 0

Eric
Eric

Reputation: 97691

A more pythonic version of Thorsten's answer:

from collections import namedtuple

class Element(namedtuple("ElementBase", "id sent words")):
    @classmethod
    def parse(cls, source):
        lines = source.split("\n")
        return cls(
            id=lines[0][4:],
            sent=lines[1][6:],
            words=dict(
                line.split("\t") for line in lines[2:]
            )
        )

text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283

ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""

elements = [Element.parse(part) for part in text.split("\n\n")]

for el in elements:
    print el
    print el.id
    print el.sent
    print el.words
    print

Upvotes: 2

Thorsten Kranz
Thorsten Kranz

Reputation: 12765

I'd regard every part of the topmost split as an "object". Thus, I'd create a class with properties corresponding to each part.

class Element(object):
    def __init__(self, source):
        lines = source.split("\n")
        self._id = lines[0][4:]
        self._sent = lines[1][6:]
        self._words = {}
        for line in lines[2:]:
            word, id_ = line.split("\t")
            self._words[word] = id_

    @property
    def ID(self):
        return self._id

    @property
    def sent(self):
        return self._sent

    @property
    def words(self):
        return self._words

    def __str__(self):
        return "Element %s, containing %i words" % (self._id, len(self._words))

text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283

ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""

elements = [Element(part) for part in text.split("\n\n")]

for el in elements:
    print el
    print el.ID
    print el.sent
    print el.words
    print

In the main code (one line, the list comprehension) the text is only split at each double new-line. Then, all logic is deferred into the __init__ method, making it very local.

Using a class also gives you the benefit of __str__, allowing you control over how your objects are printed.

You could also consider rewriting the last three lines of __init__ to:

self._words = dict([line.split("\t") for line in lines[2:]])

but I wrote a plain loop as it seemed to be easier to understand.

Using a class also gives you the

Upvotes: 0

Related Questions