bkshi
bkshi

Reputation: 300

Extract words from text file

I am working with recursive neural networks and need to process my input text file (containing trees) to extract words. The input file looks like :

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))

As an output I want the list of words in new text file as :

The

Rock

is

destined

...

(Ignore the spaces in between lines.)

I tried doing it in python but could not arrive at a solution. Also, I read that awk can be used for text processing but was unable to produce any working code. Any help is appreciated.

Upvotes: 2

Views: 12558

Answers (4)

Engineero
Engineero

Reputation: 12938

You can use regex!

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

Note that re.findall will return a list of matches, so if you want to print them all out in a single sentence, you can use:

' '.join(results)

or whatever other character you want to separate words with instead of a blank space.

Breaking the regular expression pattern down we have:

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If the numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

Upvotes: 4

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

For the record, we can choose what to throw away rather than what to keep. For example, we can split on parens, spaces and numbers. The reminder consists of words and punctuation. This might be handy for non-latin text and special characters.

import re

# split on parens, numbers and spaces
spl = re.compile("\(|\s|[0-9]|\)")
words = filter(None, spl.split(string_to_split))

Upvotes: 3

Ajax1234
Ajax1234

Reputation: 71461

You can use re.findall:

import re
with open('tree_file.txt') as f, open('word_list.txt', 'a') as f1:
   f1.write('\n'.join(set(re.findall("[a-zA-Z\-\.'/]+", f.read()))))

When running the code above on the text, the output is:

make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century

Upvotes: 3

A. Colonna
A. Colonna

Reputation: 872

You can use re.compile:

import re
def getWords(text):
    return re.compile('[A-Za-z]').findall(text)

with open('input_file.txt') as f_in:
  with open('output_file.txt', 'a') as f_out:
    f_out.write('\n'.join(getWords(f_in.read())))

Upvotes: 2

Related Questions