kulukrok
kulukrok

Reputation: 877

Merging Words into a Line

I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:

Inside the .txt:

"delimiter_string"
"row_1_data" "row_2_data"  
"row_1_data" "row_2_data"  
"row_1_data" "row_2_data"  
"row_1_data" "row_2_data"  
"row_1_data" "row_2_data"  
"delimiter_string" 
"row_1_data" "row_2_data"  
"row_1_data" "row_2_data"  
...

Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.

Here is my code:

import sys
import re

newLine = ''

for line in sys.stdin:

    word = line.split(' ')[1]

    if word == '<S>+BSTag': 
        continue

    elif word == '</S>+ESTag':
        print newLine
        newLine = ''
        continue

    else:
        w = re.sub('\[.*?]', '', word)

        if newLine == '':
            newLine += w
        else:
            newLine += ' ' + w

"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.

The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.

For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.

Any help will be appreciated, and thanks for reading the long post :)


Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.

Cheers.

Upvotes: 3

Views: 1387

Answers (2)

snurre
snurre

Reputation: 3105

def foo(lines):
    output = []
    for line in lines:
        words = line.split()
        if len(words) < 2:
            word = words[0]
        else:
            word = words[1]
        if word == '</S>+ESTag':
            yield ' '.join(output)
            output = []
        elif word != '<S>+BSTag':
            output.append(words[1])

for sentence in foo(sys.stdin):
    print sentence

Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [ and ] with '', so it ends up printing empty strings.

Upvotes: 1

GP89
GP89

Reputation: 6730

I think the problem is that the script isn't being executed (unless you just excluded the shebang in the code you posted)

Try this

cat file.txt | python script.py | less

Upvotes: 0

Related Questions