Daichi
Daichi

Reputation: 309

How do I segment my textfile into paragraphs in python?

I have a text file:

000140.psd

1) You've heard of slow food. 

nsubj(heard-3, You-1)
aux(heard-3, 've-2)
root(ROOT-0, heard-3)
case(food-6, of-4)
amod(food-6, slow-5) s1
nmod:of(heard-3, food-6) t1

2) This is slow denim. 

nsubj(denim-4, This-1)
cop(denim-4, is-2)
amod(denim-4, slow-3) s1
root(ROOT-0, denim-4) t1

And I want run a loop to look through all the lines containing an s1 (or s2,s3 etc) in each individual paragraph. I want to be able to create two lists for each paragraph. The first list will contain the lines with 's#' in it and the other list will contain all of the lines. This is so I can create 'rules' to determine which lines should be labelled 't#', in this example t1 is given, but I want to determine t# in cases where it isn't already marked. Is there a way I can make 2 distinct lists for each paragraph so that I can automate a comparison?

I've tried:

lexxe = open('000140.ant')
for line in lexxe:
    line = line.rstrip()
    if re.search('s[0-9]$', line):
        source.append(line)
print(source)

but this only gives me a list of ALL the lines that contain s + a number in it.

Upvotes: 1

Views: 875

Answers (1)

tituszban
tituszban

Reputation: 5152

You need to first split your text into paragraphs, than do the processing you want to do:

Read your file into a string:

lexxe = open('000140.ant').read()

Than split it into paragraphs using regex:

paragraphs = re.sub(r'(\n\d\))', r'|\1', lexxe).split('|')

This will split on every new line followed by a single number and a closing bracket. I had to do a workaround, using the | character so the start of the paragraphs isn't consumed. This will not work if you use | anywhere in your text, but you can select a different character.

Than you can find the s# lines by paragraphs using list comprehension:

source = [[l.rstrip() for l in p.split('\n') if re.search(r's\d$', l.rstrip())] for p in paragraphs]

So you'll end up with:

> paragraphs
['\n000140.psd\n', "\n1) You've heard of slow food. \n\nnsubj(heard-3, You-1)\naux(heard-3, 've-2)\nroot(ROOT-0, heard-3)\ncase(food-6, of-4)\namod(food-6, slow-5) s1\nnmod:of(heard-3, food-6) t1\n", '\n2) This is slow denim. \n\nnsubj(denim-4, This-1)\ncop(denim-4, is-2)\namod(denim-4, slow-3) s1\nroot(ROOT-0, denim-4) t1\n']

which you can split to lines with:

paragraph_lines = [p.split('\n') for p in paragraphs]

Giving you:

> paragraph_lines
[['', '000140.psd', ''], ['', "1) You've heard of slow food. ", '', 'nsubj(heard-3, You-1)', "aux(heard-3, 've-2)", 'root(ROOT-0, heard-3)', 'case(food-6, of-4)', 'amod(food-6, slow-5) s1', 'nmod:of(heard-3, food-6) t1', ''], ['', '2) This is slow denim. ', '', 'nsubj(denim-4, This-1)', 'cop(denim-4, is-2)', 'amod(denim-4, slow-3) s1', 'root(ROOT-0, denim-4) t1', '']]

And source will be:

> source
[[], ['amod(food-6, slow-5) s1'], ['amod(denim-4, slow-3) s1']]

Keep in mind, you'll have the header (000140.psd) as a paragraph, but you can just do paragraphs = paragraphs[1:] to get rid of it

Upvotes: 1

Related Questions