aartist
aartist

Reputation: 3236

Extracting Words using Perl

I like to extract the words from the text. I have written the simple regex.

my $regex = qr[\W];
while(<DATA>){
    push  @words, split $regex;
}   

I like to modify it to include proper names. Proper names may combine multiple 'words'. For example..

@names = ('John Smith', 'Joe Smith');

Upvotes: 1

Views: 360

Answers (2)

wespiserA
wespiserA

Reputation: 3168

I don't think there is a definitive solution. The regular expression is limited in a complex text space like a web page or book with many anomalies, e.g. what about book titles? Look at using either 1) natural language processing or 2) An index approach where you identify two words, starting with capital letter, split by one space, and see if one of them is contained with an index of known first or last names. good luck.

Upvotes: 2

JRFerguson
JRFerguson

Reputation: 7516

Perhaps:

!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my @words;
while(<DATA>){
    push @words, $1 if m{([A-Z]\w*\s+[A-Z]\w*)};
}   
for my $name (@words) {
    print "$name\n";
}
print Dumper \@words;
__DATA__
John Smith I am
He is Joe Smith 
John Doe
Sam
Sally
Sally Girl

Upvotes: 1

Related Questions