Reputation: 300
I am working with recursive neural networks and need to process my input text file (containing trees) to extract words. The input file looks like :
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))
As an output I want the list of words in new text file as :
The
Rock
is
destined
...
(Ignore the spaces in between lines.)
I tried doing it in python but could not arrive at a solution. Also, I read that awk can be used for text processing but was unable to produce any working code. Any help is appreciated.
Upvotes: 2
Views: 12558
Reputation: 12938
You can use regex!
import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
# 'Rock',
# 'is',
# 'destined',
# 'to',
# 'be',
# 'the',
# ...
Note that re.findall
will return a list of matches, so if you want to print them all out in a single sentence, you can use:
' '.join(results)
or whatever other character you want to separate words with instead of a blank space.
Breaking the regular expression pattern down we have:
pattern = r"""
\( # match opening parenthesis
\d # match a number. If the numbers can be >9, use \d+
\s+ # match one or more white space characters
( # begin capturing group (only return stuff inside these parentheses)
'? # match zero or one apostrophes (so we don't miss posessives)
\w+ # match one or more text characters
) # end capture group
"""
Upvotes: 4
Reputation: 11602
For the record, we can choose what to throw away rather than what to keep. For example, we can split on parens, spaces and numbers. The reminder consists of words and punctuation. This might be handy for non-latin text and special characters.
import re
# split on parens, numbers and spaces
spl = re.compile("\(|\s|[0-9]|\)")
words = filter(None, spl.split(string_to_split))
Upvotes: 3
Reputation: 71461
You can use re.findall
:
import re
with open('tree_file.txt') as f, open('word_list.txt', 'a') as f1:
f1.write('\n'.join(set(re.findall("[a-zA-Z\-\.'/]+", f.read()))))
When running the code above on the text, the output is:
make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century
Upvotes: 3
Reputation: 872
You can use re.compile
:
import re
def getWords(text):
return re.compile('[A-Za-z]').findall(text)
with open('input_file.txt') as f_in:
with open('output_file.txt', 'a') as f_out:
f_out.write('\n'.join(getWords(f_in.read())))
Upvotes: 2