Reputation: 1
I'm working on an NLP project where I have been given a POS tagged dataset of sentences to work with. The format of the dataset (shall be providing example sentences too) is
('word', 'pos_tag')
unless if the word has a single quote (affixes like 're,'s,n't and also '' for end quotes) in which case the format is
("word", "pos_tag")
The code segment I am using to process this data set is as follows
def corpus_reader(filepath):
pattern = '\(\'(\w+)\', |(?<=\").*?\", '
sentences = []
with open( filepath ) as f:
corpus = f.readlines()
for line in corpus:
temp = re.findall( pattern, line )
sentences.append( temp )
return sentences
The pattern consists of two patterns cond1|cond2 to detect.
cond1 matches and extracts all the words in the corpus.
cond2 is meant to match '', n't, 's and 're which are enclosed within double quotes like i have mentioned before, but the second condition doesn't work to do that.
Desired result is for a list of all the pos tagged tokens
Could someone please provide the correct regex pattern to use to detect the cases I have mentioned?
Here are example sentences that are to be parsed that contain 're, n't, 's and ''
[('We', 'PRP'), ("'re", 'VBP'), ('talking', 'VBG'), ('about', 'IN'), ('years', 'NNS'), ('ago', 'IN'), ('before', 'IN'), ('anyone', 'NN'), ('heard', 'VBD'), ('of', 'IN'), ('asbestos', 'NN'), ('having', 'VBG'), ('any', 'DT'), ('questionable', 'JJ'), ('properties', 'NNS'), ('.', '.')]
[('
', '
'), ('We', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('useful', 'JJ'), ('information', 'NN'), ('on', 'IN'), ('whether', 'IN'), ('users', 'NNS'), ('are', 'VBP'), ('at', 'IN'), ('risk', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('T-1', '-NONE-'), ('James', 'NNP'), ('A.', 'NNP'), ('Talcott', 'NNP'), ('of', 'IN'), ('Boston', 'NNP'), ("'s", 'POS'), ('Dana-Farber', 'NNP'), ('Cancer', 'NNP'), ('Institute', 'NNP'), ('.', '.')][('The', 'DT'), ('U.S.', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('few', 'JJ'), ('industrialized', 'VBN'), ('nations', 'NNS'), ('that', 'WDT'), ('T-7', '-NONE-'), ('does', 'VBZ'), ("n't", 'RB'), ('have', 'VB'), ('a', 'DT'), ('higher', 'JJR'), ('standard', 'NN'), ('of', 'IN'), ('regulation', 'NN'), ('for', 'IN'), ('the', 'DT'), ('smooth', 'JJ'), (',', ','), ('needle-like', 'JJ'), ('fibers', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('crocidolite', 'NN'), ('that', 'WDT'), ('T-1', '-NONE-'), ('are', 'VBP'), ('classified', 'VBN'), ('*-5', '-NONE-'), ('as', 'IN'), ('amphobiles', 'NNS'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('Brooke', 'NNP'), ('T.', 'NNP'), ('Mossman', 'NNP'), (',', ','), ('a', 'DT'), ('professor', 'NN'), ('of', 'IN'), ('pathlogy', 'NN'), ('at', 'IN'), ('the', 'DT'), ('University', 'NNP'), ('of', 'IN'), ('Vermont', 'NNP'), ('College', 'NNP'), ('of', 'IN'), ('Medicine', 'NNP'), ('.', '.')]
[('
', '
'), ('What', 'WP'), ('T-14', '-NONE-'), ('matters', 'VBZ'), ('is', 'VBZ'), ('what', 'WP'), ('advertisers', 'NNS'), ('are', 'VBP'), ('paying', 'VBG'), ('T-15', '-NONE-'), ('per', 'IN'), ('page', 'NN'), (',', ','), ('and', 'CC'), ('in', 'IN'), ('that', 'DT'), ('department', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('doing', 'VBG'), ('fine', 'RB'), ('this', 'DT'), ('fall', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('T-1', '-NONE-'), ('Mr.', 'NNP'), ('Spoon', 'NNP'), ('.', '.')]
Thanks and gratitude to all attempts of answering and helping
Upvotes: 0
Views: 170
Reputation: 44108
I would use:
( # start of capture group 1
(?<=\(') # first alternative: positive lookbehind: ('
[^']* # zero or more characters other than '
(?=',) # positive lookahead: ',
| # start of second alternative:
(?<=\(") # positive lookbehind: ("
[^"]* # zero or more characters other than "
(?=",) # positive lookahead: ",
)
Upvotes: 0