Reputation: 303
This is a sample of the raw text i'm reading:
ID: 00000001
SENT: to do something
to 01573831
do 02017283
something 03517283
ID: 00000002
SENT: just an example
just 06482823
an 01298744
example 01724894
Right now I'm trying to split it into a lists of lists of lists.
Topmost level list: By the ID so 2 elements here (done)
Next level: Within each ID, split by newlines
Last level: Within each line split the word and ID, for the lines beginning with ID or SENT, it doesn't matter if they are split or not. Between the word and their ID is an indent (\t)
Current code:
f=open("text.txt","r")
raw=list(f)
text=" ".join(raw)
wordlist=text.split("\n \n ") #split by ID
toplist=wordlist[:2] #just take 2 IDs
Edit: I was going to cross-reference the words to another text file to add their word classes which is why i asked for a lists of lists of lists.
Steps:
1) Use .append() to add on word classes for each word
2) Use "\t".join() to connect a line together
3) Use "\n".join() to connect different lines in an ID
4) "\n\n".join() to connect all the IDs together into a string
Output:
ID: 00000001
SENT: to do something
to 01573831 prep
do 02017283 verb
something 03517283 noun
ID: 00000002
SENT: just an example
just 06482823 adverb
an 01298744 ind-art
example 01724894 noun
Upvotes: 0
Views: 379
Reputation: 983
would this work for you?:
Top - level (which you have done)
def get_parent(text, parent):
"""recursively walk through text, looking for 'ID' tag"""
# find open_ID and close_ID
open_ID = text.find('ID')
close_ID = text.find('ID', open_ID + 1)
# if there is another instance of 'ID', recursively walk again
if close_ID != -1:
parent.append(text[open_ID : close_ID])
return get_parent(text[close_ID:], parent)
# base-case
else:
parent.append(text[open_ID:])
return
Second - level: split by newlines:
def child_split(parent):
index = 0
while index < len(parent):
parent[index] = parent[index].split('\n')
index += 1
Third - level: split the 'ID' and 'SENT' fields
def split_field(parent, index):
if index < len(parent):
child = 0
while child < len(parent[index]):
if ':' in parent[index][child]:
parent[index][child] = parent[index][child].split(':')
else:
parent[index][child] = parent[index][child].split()
child += 1
return split_field(parent, index + 1)
else:
return
Running it all together:
def main(text):
parent = []
get_parent(text, parent)
child_split(parent)
split_field(parent, 0)
The result is quite nested, perhaps it can be cleaned up somewhat? Or perhaps the split_fields() function could return a dictionary?
Upvotes: 0
Reputation: 133764
I'm not sure exactly what output you need but you can adjust this to fit your needs (This uses the itertools
grouper recipe):
>>> from itertools import izip_longest
>>> def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
>>> with open('text.txt') as f:
print [[x.rstrip().split(None, 1) for x in g if x.rstrip()]
for g in grouper(6, f, fillvalue='')]
[[['ID:', '00000001'], ['SENT:', 'to do something'], ['to', '01573831'], ['do', '02017283'], ['something', '03517283']],
[['ID:', '00000002'], ['SENT:', 'just an example'], ['just', '06482823'], ['an', '01298744'], ['example', '01724894']]]
Upvotes: 0
Reputation: 97691
A more pythonic version of Thorsten's answer:
from collections import namedtuple
class Element(namedtuple("ElementBase", "id sent words")):
@classmethod
def parse(cls, source):
lines = source.split("\n")
return cls(
id=lines[0][4:],
sent=lines[1][6:],
words=dict(
line.split("\t") for line in lines[2:]
)
)
text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283
ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""
elements = [Element.parse(part) for part in text.split("\n\n")]
for el in elements:
print el
print el.id
print el.sent
print el.words
print
Upvotes: 2
Reputation: 12765
I'd regard every part of the topmost split as an "object". Thus, I'd create a class with properties corresponding to each part.
class Element(object):
def __init__(self, source):
lines = source.split("\n")
self._id = lines[0][4:]
self._sent = lines[1][6:]
self._words = {}
for line in lines[2:]:
word, id_ = line.split("\t")
self._words[word] = id_
@property
def ID(self):
return self._id
@property
def sent(self):
return self._sent
@property
def words(self):
return self._words
def __str__(self):
return "Element %s, containing %i words" % (self._id, len(self._words))
text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283
ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""
elements = [Element(part) for part in text.split("\n\n")]
for el in elements:
print el
print el.ID
print el.sent
print el.words
print
In the main code (one line, the list comprehension) the text is only split at each double new-line. Then, all logic is deferred into the __init__
method, making it very local.
Using a class also gives you the benefit of __str__
, allowing you control over how your objects are printed.
You could also consider rewriting the last three lines of __init__
to:
self._words = dict([line.split("\t") for line in lines[2:]])
but I wrote a plain loop as it seemed to be easier to understand.
Using a class also gives you the
Upvotes: 0