Arthurrrrrr
Arthurrrrrr

Reputation: 3

Removing words from python lists?

I am a complete noob in python and web scraping and have ran into some issues quite early. I have been able to scrape a Dutch news website their titles and splitting the words. Now my objective is to remove certain words from the results. For instance I don't want word like "het" and "om" in the list. Does anyone know how I can do this? (I'm using python requests and BeautifulSoup)

import requests
from bs4 import BeautifulSoup

url="http://www.nu.nl"
r=requests.get(url)

soup=BeautifulSoup(r.content)

g_data=soup.find_all("span" , {"class": "title"})


for item in g_data:
    print item.text.split()

 

Upvotes: 0

Views: 447

Answers (1)

Sam King
Sam King

Reputation: 2188

In natural language processing, the term for excluding common words is called "stop words".

Do you want to preserve the order and count of each word, or do you just want the set of words that appear on the page?

If you just want the set of words that appear on the page, using sets is probably the way to go. Something like the following might work:

# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
    'het',
    'om'
])

all_words = set()
for item in g_data:
    all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words

If, on the other hand, you care about the order, you could just refrain from adding stop words to your list.

words_in_order = []
for item in g_data:
    words_from_span = item.text.split()
    # You might want to break this out into its own function for modularity.
    for word in words_from_span:
        if word not in STOP_WORDS:
            words_in_order.append(word)
print words_in_order

If you don't care about order but you want frequency, you could create a dict (or defaultdict for convenience) of words to counts.

from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
    # You might want to break this out into its own function for modularity.
    for word in item.text.split():
        if word not in STOP_WORDS:
            word_counts[word] += 1
for word, count in word_counts.iteritems():
    print '%s: %d' % (word, count)

Upvotes: 1

Related Questions