markwk
markwk

Reputation: 135

Parsing Text with PHP / Python? How? With What?

I got a Drupal Ed site for foreign language learners I'm build that has a vocabulary sharing function and flashcard feature. I am thinking of adding a way to parse texts (newspaper articles and such) and output a list of words used, then perhaps cross connect to vocabulary section.

For now, I'm wondering if there are any programs/scripts in php ideally or possibly python that might be used to parse the text into a list of words used (and possibly be able to exclude a list of most common words). I'd like to be able to adapt to work within Drupal so php would be best. I'm open to using any of the various stuff out there? Any ideas?

I'm not really sure where to even start on this one?

Upvotes: 1

Views: 565

Answers (4)

jrn.ak
jrn.ak

Reputation: 36627

Simplistic start:

<?php
    // source text
    $paragraph = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        Proin congue, quam nec tincidunt congue, massa ipsum sodales tellus,
        in rhoncus sem quam quis ante. Nam condimentum pellentesque libero at
        blandit. Suspendisse felis sem, interdum pulvinar ultricies a, auctor
        vel leo. Curabitur congue mi nec purus placerat sit amet mollis magna
        laoreet. Duis eu purus non turpis lacinia sagittis. Aliquam tristique
        nulla volutpat neque posuere faucibus. Aenean tempus diam quis sem
        convallis id cursus lorem sagittis. Nam feugiat, felis nec tincidunt
        aliquet, felis lectus bibendum mi, ut tincidunt purus urna ac felis.
        Quisque ut lectus dolor. Duis ipsum arcu, adipiscing id vestibulum
        fringilla, euismod non augue. Nullam quis ipsum nec tortor tristique
        egestas sed nec leo. Pellentesque tempus velit lacus, sit amet rhoncus
        mi. Curabitur justo ipsum, consectetur ac vestibulum sed, porttitor
        eget dui. Vivamus nisi lorem, porta vel gravida quis, varius et elit.
        Nulla eros metus, congue sit amet interdum at, porta eget ligula.";

    // remove newlines
    $paragraph = str_replace(array("\r","\n"), '', $paragraph);

    // convert to lowercase
    $paragraph = strtolower($paragraph);

    // remove non-alphanumeric characters
    $paragraph = preg_replace('/[^A-Za-z0-9\s]/', '', $paragraph);

    // convert into array
    $words = explode(' ', $paragraph);

    // remove null values
    $words = array_filter($words, 'strlen');

    // remove duplicate values
    $words = array_unique($words);

    // sort array alphabetically (optional)
    natsort($words);

    // reindex array
    $words = array_values($words);

    // display array
    print_r($words);
?>

Update: Now removes newlines. Separated all modifications into individual commands.

Upvotes: 2

philonous
philonous

Reputation: 4071

To exclude very common words you can use a list of stop words, as for instance:

You could load them and intersect your set of words with the corresponding set of stop words:

<?php

// read in stop words
$stopwords = file('ftp://ftp.cs.cornell.edu/pub/smart/english.stop', FILE_IGNORE_NEW_LINES);

// read in the words from your text
$words_from_text = array("notfrequent", "notfrequenttoo", "a", "is", "the", "superspecialword");

// remove the stop words
$final_words = array_diff($words_from_text, $stopwords);

// and have a look
var_dump($final_words);

?>

Upvotes: 1

hnprashanth
hnprashanth

Reputation: 831

You can make use of PHP's built in file functions to read the file. http://www.w3schools.com/PHP/php_file.asp

Upvotes: 1

nmichaels
nmichaels

Reputation: 51029

If the text of your article is a string,

# Get the set of words used in the text:
words = set(word.lower() for word in text.split() if word.isalpha())
# Get word counts
frequencies = {word: text.count(word) for word in words}

You can drop the most common words from the set pretty easily with that. It might be worth while to strip away punctuation instead of just using isalpha() on it.

Upvotes: 0

Related Questions