Reputation: 103
I am trying to extract only noun and noun phrases to address data (a column a inside csv file).
I was able to remove the stop words, punctuations and numbers from the data. Also was able POS tag the data, but not able Extract Noun Phrases and attach back to data frame. Let me know what went wrong
stopwords=nltk.corpus.stopwords.words('english')
user_defined_stop_words=['hong','kong','hk','kowloon','hongkong']
new_stop_words=stopwords+user_defined_stop_words
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
texts = data['Clean_addr'].tolist()
tagged_texts = pos_tag_sents(map(word_tokenize, texts))
data['POS']=tagged_texts
data['POS']=data['POS'].apply(lambda x:' '.join([item[0] for item in x if (item[0][1]=='NNP' or item[0][1]=='NNS')]))
Sample Dump of the File I am using
https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0
Upvotes: 0
Views: 340
Reputation: 30605
Based on the data linked :
data['POS'].apply(lambda x : ','.join([i[0] for i in x if (i[1]=='NNS' or i[1] =='NNP')]))
0 des
1 des
2 cfa,des
3 registrations
4
5 floors
6 queens
7 queens
8 queens
9
10 solicitors
11
12
13
14
15 des
Name: POS, dtype: object
Upvotes: 1