Reputation: 99
How can i convert a data format like:
James Smith was born on November 17, 1948
into something like
("James Smith", DOB, "November 17, 1948")
without having to rely on positional index of strings
I have tried the following
from nltk import word_tokenize, pos_tag
new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)
How to proceed further to get the output in desired fromat.
Upvotes: 2
Views: 112
Reputation: 172
You could always use a regular expressions.
The regex (\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)
will match and return data from specifically the string format above.
Here's it in action: https://regex101.com/r/W2ykKS/1
Regex in python:
import re
regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"
matches = re.search(regex, test_str)
# group 0 in a regex is the input string
print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948
Upvotes: 1
Reputation: 1657
Split the string with 'was born on' after that trim the spaces and assign to name and dob
Upvotes: 1