bruno cuconato
bruno cuconato

Reputation: 164

how to turn text into nested list

I'm trying to turn a text input into a nested list that preserves its structure. for the moment I have a function that takes a text and a desired "depth" and outputs a nested list of this depth, breaking the text at every newline, sentence, or word.

def text_split(text, depth):
    depth_list = [' ', '.', '\n']
    if isinstance(text, str):
        text = text.strip('. ')
        text = text.split(depth_list[depth])
    if depth >= 0:
        depth -= 1
        for ix, item in enumerate(text):
                item = item.strip('. ')
                text[ix] = text_split(item, depth)
    return text

this takes text such as

text1 = """acabei de ler um livro. um diário.
mas a liberdade sempre chamou fountaine mais forte.
a cada viagem fountaine ía mais longe. aprendeu a andar de bicicleta e viajou o sul da frança.

esse é o tipo de pergunta feita na última edição do prêmio Loebner, em que participantes precisam responder à algumas questões feitas pelo júri.

o que tem de especial nessa competição é que ela não é para humanos, mas sim para robôs. o prêmio Loebner é uma implementação do teste de Turing.

"""

into

[   [[['acabei'], ['de'], ['ler'], ['um'], ['livro']], [['um'], ['diário']]],
[   [   ['mas'],
        ['a'],
        ['liberdade'],
        ['sempre'],
        ['chamou'],
        ['fountaine'],
        ['mais'],
        ['forte']]],
[   [   ['a'],
        ['cada'],
        ['viagem'],
        ['fountaine'],
        ['ía'],
        ['mais'],
        ['longe']],
    [   ['aprendeu'],
        ['a'],
        ['andar'],
        ['de'],
        ['bicicleta'],
        ['e'],
        ['viajou'],
        ['o'],
        ['sul'],
        ['da'],
        ['frança']]],
[[['']]], ... ]]]]

now this is probably not the best or most elegant way of doing this, and it has some problems, such as the [[['']]] appearing after the \n is split (something that could be solved by using .splitlines(), but I could not find a nice way of calling this method in a recursive function).

what is a better way of doing this? should I be using nested lists at all? (i'm planning on iterating through this afterwards). thanks for the advice!

Upvotes: 1

Views: 616

Answers (2)

AChampion
AChampion

Reputation: 30258

You can use nested list comprehensions just using your criteria for splitting:

>>> [[s.split() for s in line.split('.') if s] for line in text1.split('\n') if line]
[[['acabei', 'de', 'ler', 'um', 'livro'], ['um', 'diário']],
 [['mas', 'a', 'liberdade', 'sempre', 'chamou', 'fountaine', 'mais', 'forte']],
 [['a', 'cada', 'viagem', 'fountaine', 'ía', 'mais', 'longe'],
  ['aprendeu', 'a', 'andar', 'de', 'bicicleta', 'e', 'viajou', 'o', 'sul', 'da', 'frança']],
 ...

Upvotes: 1

meyer9
meyer9

Reputation: 1140

Here's the best I could come up with to fit your requirements:

text = []
for line in text1.split('\n'):
  sentences = []
  for sentence in line.split('.'):
    words = []
    for word in sentence.split(' '):
      if len(word.strip()) > 0: # make sure we are adding something
        words.append(word.strip())
    if len(words) > 0:
      sentences.append(words)
  if len(sentences) > 0:
    text.append(sentences)

Using this, we have a well-defined structure for the array and we can be sure that we don't have any blanks or empty arrays. Also, recursion is not a good thing to use here because you have a clear structure that the text should be. You know the recursion would not reach more than 3 levels of depth.

Also, if you want a recursive version, you should state it in your question and clear up the requirements.

Upvotes: 1

Related Questions