Reputation: 135
I got a Drupal Ed site for foreign language learners I'm build that has a vocabulary sharing function and flashcard feature. I am thinking of adding a way to parse texts (newspaper articles and such) and output a list of words used, then perhaps cross connect to vocabulary section.
For now, I'm wondering if there are any programs/scripts in php ideally or possibly python that might be used to parse the text into a list of words used (and possibly be able to exclude a list of most common words). I'd like to be able to adapt to work within Drupal so php would be best. I'm open to using any of the various stuff out there? Any ideas?
I'm not really sure where to even start on this one?
Upvotes: 1
Views: 565
Reputation: 36627
Simplistic start:
<?php
// source text
$paragraph = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Proin congue, quam nec tincidunt congue, massa ipsum sodales tellus,
in rhoncus sem quam quis ante. Nam condimentum pellentesque libero at
blandit. Suspendisse felis sem, interdum pulvinar ultricies a, auctor
vel leo. Curabitur congue mi nec purus placerat sit amet mollis magna
laoreet. Duis eu purus non turpis lacinia sagittis. Aliquam tristique
nulla volutpat neque posuere faucibus. Aenean tempus diam quis sem
convallis id cursus lorem sagittis. Nam feugiat, felis nec tincidunt
aliquet, felis lectus bibendum mi, ut tincidunt purus urna ac felis.
Quisque ut lectus dolor. Duis ipsum arcu, adipiscing id vestibulum
fringilla, euismod non augue. Nullam quis ipsum nec tortor tristique
egestas sed nec leo. Pellentesque tempus velit lacus, sit amet rhoncus
mi. Curabitur justo ipsum, consectetur ac vestibulum sed, porttitor
eget dui. Vivamus nisi lorem, porta vel gravida quis, varius et elit.
Nulla eros metus, congue sit amet interdum at, porta eget ligula.";
// remove newlines
$paragraph = str_replace(array("\r","\n"), '', $paragraph);
// convert to lowercase
$paragraph = strtolower($paragraph);
// remove non-alphanumeric characters
$paragraph = preg_replace('/[^A-Za-z0-9\s]/', '', $paragraph);
// convert into array
$words = explode(' ', $paragraph);
// remove null values
$words = array_filter($words, 'strlen');
// remove duplicate values
$words = array_unique($words);
// sort array alphabetically (optional)
natsort($words);
// reindex array
$words = array_values($words);
// display array
print_r($words);
?>
Update: Now removes newlines. Separated all modifications into individual commands.
Upvotes: 2
Reputation: 4071
To exclude very common words you can use a list of stop words, as for instance:
You could load them and intersect your set of words with the corresponding set of stop words:
<?php
// read in stop words
$stopwords = file('ftp://ftp.cs.cornell.edu/pub/smart/english.stop', FILE_IGNORE_NEW_LINES);
// read in the words from your text
$words_from_text = array("notfrequent", "notfrequenttoo", "a", "is", "the", "superspecialword");
// remove the stop words
$final_words = array_diff($words_from_text, $stopwords);
// and have a look
var_dump($final_words);
?>
Upvotes: 1
Reputation: 831
You can make use of PHP's built in file functions to read the file. http://www.w3schools.com/PHP/php_file.asp
Upvotes: 1
Reputation: 51029
If the text of your article is a string,
# Get the set of words used in the text:
words = set(word.lower() for word in text.split() if word.isalpha())
# Get word counts
frequencies = {word: text.count(word) for word in words}
You can drop the most common words from the set pretty easily with that. It might be worth while to strip
away punctuation instead of just using isalpha()
on it.
Upvotes: 0