Shaheen Gul
Shaheen Gul

Reputation: 93

how to extract the contextual words of a token in python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this

 IL-2  
 gene  
 expression  
 and  
 NF-kappa  
 B  
 activation  
 through  
 CD28  
 requires  
 reactive  
 oxygen  
 production  
 by  
 5-lipoxygenase  
 .  

mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is

import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
    print tokens[l],tokens[l+1]
f.close();

Upvotes: 0

Views: 1769

Answers (4)

Shaheen Gul
Shaheen Gul

Reputation: 93

This code also gives the same result

import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');               
trigram = ngrams(tokens,3)
for t in trigram:
    f_tri.write(str(t)+"\n")
f_tri.close()

Upvotes: 0

Shaheen Gul
Shaheen Gul

Reputation: 93

file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');

with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
    f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
    f.write("\n")
f.close()

Upvotes: 0

Rameshkumar R
Rameshkumar R

Reputation: 140

First of all, list.index(x) : Return the index in the list of the first item whose value is x.

>>> ["foo", "bar", "baz"].index('bar')
1

In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.

>>> print lines.index(1)
ValueError: 1 is not in list

change your code like this :

file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');

with open(file,'r') as rf:
    lines = rf.readlines();

for word in range(1,len(lines)-1):
    f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())

f.close()

Upvotes: 1

Laraconda
Laraconda

Reputation: 707

I dont really understood what you want to do, but, I'll do my best.

If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.

You may need to tokenize a sentence or a document.

import nltk


def tokenize_query(query):
    return nltk.word_tokenize(query)

f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)

We can also read a file one line at a time using a for loop:

f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
    print(line.strip())

r means 'read' and U means 'universal', if you are wondering.

strip() is just cutting '\n' from the text.

The context may be provided by wordnet and all its functions. I guess you should use synsets with the word's pos (part of speech).

A synset is sort of a synonyms list in a semantic way.

NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.

Upvotes: 0

Related Questions