Sumanth Balaji
Sumanth Balaji

Reputation: 1

Regex pattern to find all matches for suffixes, end quotes and words in English POS tagged corpus

I'm working on an NLP project where I have been given a POS tagged dataset of sentences to work with. The format of the dataset (shall be providing example sentences too) is

('word', 'pos_tag')

unless if the word has a single quote (affixes like 're,'s,n't and also '' for end quotes) in which case the format is

("word", "pos_tag")

The code segment I am using to process this data set is as follows

def corpus_reader(filepath):
 pattern = '\(\'(\w+)\', |(?<=\").*?\", ' 
 sentences = []
 with open( filepath ) as f:
     corpus = f.readlines()

 for line in corpus:
    temp = re.findall( pattern, line )
    sentences.append( temp )

return sentences

The pattern consists of two patterns cond1|cond2 to detect.

cond1 matches and extracts all the words in the corpus.

cond2 is meant to match '', n't, 's and 're which are enclosed within double quotes like i have mentioned before, but the second condition doesn't work to do that.

Desired result is for a list of all the pos tagged tokens

Could someone please provide the correct regex pattern to use to detect the cases I have mentioned?

Here are example sentences that are to be parsed that contain 're, n't, 's and ''

[('We', 'PRP'), ("'re", 'VBP'), ('talking', 'VBG'), ('about', 'IN'), ('years', 'NNS'), ('ago', 'IN'), ('before', 'IN'), ('anyone', 'NN'), ('heard', 'VBD'), ('of', 'IN'), ('asbestos', 'NN'), ('having', 'VBG'), ('any', 'DT'), ('questionable', 'JJ'), ('properties', 'NNS'), ('.', '.')]

[('', ''), ('We', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('useful', 'JJ'), ('information', 'NN'), ('on', 'IN'), ('whether', 'IN'), ('users', 'NNS'), ('are', 'VBP'), ('at', 'IN'), ('risk', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('T-1', '-NONE-'), ('James', 'NNP'), ('A.', 'NNP'), ('Talcott', 'NNP'), ('of', 'IN'), ('Boston', 'NNP'), ("'s", 'POS'), ('Dana-Farber', 'NNP'), ('Cancer', 'NNP'), ('Institute', 'NNP'), ('.', '.')]

[('The', 'DT'), ('U.S.', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('few', 'JJ'), ('industrialized', 'VBN'), ('nations', 'NNS'), ('that', 'WDT'), ('T-7', '-NONE-'), ('does', 'VBZ'), ("n't", 'RB'), ('have', 'VB'), ('a', 'DT'), ('higher', 'JJR'), ('standard', 'NN'), ('of', 'IN'), ('regulation', 'NN'), ('for', 'IN'), ('the', 'DT'), ('smooth', 'JJ'), (',', ','), ('needle-like', 'JJ'), ('fibers', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('crocidolite', 'NN'), ('that', 'WDT'), ('T-1', '-NONE-'), ('are', 'VBP'), ('classified', 'VBN'), ('*-5', '-NONE-'), ('as', 'IN'), ('amphobiles', 'NNS'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('Brooke', 'NNP'), ('T.', 'NNP'), ('Mossman', 'NNP'), (',', ','), ('a', 'DT'), ('professor', 'NN'), ('of', 'IN'), ('pathlogy', 'NN'), ('at', 'IN'), ('the', 'DT'), ('University', 'NNP'), ('of', 'IN'), ('Vermont', 'NNP'), ('College', 'NNP'), ('of', 'IN'), ('Medicine', 'NNP'), ('.', '.')]

[('', ''), ('What', 'WP'), ('T-14', '-NONE-'), ('matters', 'VBZ'), ('is', 'VBZ'), ('what', 'WP'), ('advertisers', 'NNS'), ('are', 'VBP'), ('paying', 'VBG'), ('T-15', '-NONE-'), ('per', 'IN'), ('page', 'NN'), (',', ','), ('and', 'CC'), ('in', 'IN'), ('that', 'DT'), ('department', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('doing', 'VBG'), ('fine', 'RB'), ('this', 'DT'), ('fall', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('T-1', '-NONE-'), ('Mr.', 'NNP'), ('Spoon', 'NNP'), ('.', '.')]

Thanks and gratitude to all attempts of answering and helping

Upvotes: 0

Views: 170

Answers (1)

Booboo
Booboo

Reputation: 44108

I would use:

(               # start of capture group 1
  (?<=\(')      # first alternative: positive lookbehind: ('
  [^']*         # zero or more characters other than '
  (?=',)        # positive lookahead: ',
|               # start of second alternative:
  (?<=\(")      # positive lookbehind: ("
  [^"]*         # zero or more characters other than "
  (?=",)        # positive lookahead: ",
)

See Regex Demo

Upvotes: 0

Related Questions