Reputation: 877
I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:
Inside the .txt:
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
...
Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.
Here is my code:
import sys
import re
newLine = ''
for line in sys.stdin:
word = line.split(' ')[1]
if word == '<S>+BSTag':
continue
elif word == '</S>+ESTag':
print newLine
newLine = ''
continue
else:
w = re.sub('\[.*?]', '', word)
if newLine == '':
newLine += w
else:
newLine += ' ' + w
"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.
The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.
For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.
Any help will be appreciated, and thanks for reading the long post :)
Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.
Cheers.
Upvotes: 3
Views: 1387
Reputation: 3105
def foo(lines):
output = []
for line in lines:
words = line.split()
if len(words) < 2:
word = words[0]
else:
word = words[1]
if word == '</S>+ESTag':
yield ' '.join(output)
output = []
elif word != '<S>+BSTag':
output.append(words[1])
for sentence in foo(sys.stdin):
print sentence
Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [
and ]
with ''
, so it ends up printing empty strings.
Upvotes: 1