Boudribila
Boudribila

Reputation: 71

Creating a custom dataset based on CoNLL2003

I'm working on a named entity recognition (NER) project and would like to create my own dataset based on the CoNLL2003 dataset (link: https://huggingface.co/datasets/conll2003). I've been looking at the CoNLL2003 data and I'm having trouble understanding how the chunk column is labeled. I'm not sure if it's based on the part-of-speech (POS) tags or on something else. Ideally, I'd like to automate the process of creating the chunk labels for my custom dataset, rather than doing it manually. Can someone explain how the chunk column is labeled in CoNLL2003 and provide some guidance on how I can programmatically generate the same labels for my own dataset?

To expalin more let’s take the first row of the dataset and try to working on it and i should have the same results.

The first sentence of the dataset which is : EU rejects German call to boycott British lamb.

# import libraries and modules needed for the project
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
import re
# We take the first sentence from the dataset conll2003
Sentence = "EU rejects German call to boycott British lamb."

The tokens of the same sentence is : ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
import re
# Tokenize the sentence
tokens = word_tokenize(Sentence)
# Print the tokens
print(tokens)

The part of speech of the same sentence is : ['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB','JJ', 'NN', '.'] which is [22, 42, 16, 21, 35, 37, 16, 21, 7]

pos_tags = {'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}
# POS Tagging of the tokens
pos_tagged = pos_tag(tokens)
# Print the POS Tagged tokens
print(pos_tagged)
# Keep only the POS tags in a list
pos_tags_only = [pos_tags[tag] for word, tag in pos_tagged]
# Print the POS
print(pos_tags_only)

The chunk tags of the same sentence should be : ['B-NP', 'B-VP', 'B-NP', 'I-NP', 'B-VP', 'I-VP','B-NP', 'I-NP', 'O'] which is [11, 21, 11, 12, 21, 22, 11, 12, 0] but how they do it i don’t have any idea i alredy test a code but i don’t get exactly the same result as this ['B-NP', 'B-VP', 'B-NP', 'I-NP', 'B-VP', 'I-VP','B-NP', 'I-NP', 'O'] please if any one have an idea of how they do it plese guide me

# To start with, this is the chunk tag set used in conll2003 dataset
chunk_tags = {'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8, 'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17, 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}  

Upvotes: 4

Views: 887

Answers (1)

anon
anon

Reputation:

Chunking is a process in NLP that works in conjunction with parts of speech (POS) tagging, to form clusters or chunks of words that should be considered together when doing text processing. This is helpful in named entity recognition (NER). For example, the tokens John and Smith are both NN - names. However, John Smith should be considered as a chunk for processing purposes. Similarly 123 Some Street - each token will have a different POS tag, but these tokens should be considered as a chunk as they form an address.

This is an overly simplistic explanation, but I hope that it is a helpful "foothold" to the topic.

Different modules will chunk differently - so you will get different results based on which chunking function or chunking module you use.

This article by Nikita Bachani on Medium is an excellent introduction to the topic.

Upvotes: 2

Related Questions