Ulvi Bajarani
Ulvi Bajarani

Reputation: 13

The Parsing of HTML files at the same directory in the Python

I have designed the code parsing HTML files:

from bs4 import BeautifulSoup
import re
import os
from os.path import join

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                FinalResult = set()
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])

` I have struggled in these case:

1) It saves the code in different lists, while I need one list with all results from various files;

2) As a result, I cannot delete the doubles from different files

How can I handle them?

Upvotes: 0

Views: 77

Answers (1)

Metalgear
Metalgear

Reputation: 3457

I think I found where you were wrong. Here's the code I changed a little bit.

from bs4 import BeautifulSoup
import re
import os
from os.path import join

# definition position should be here so that it can collect all results into one.
FinalResult = set() 

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                # FinalResult = set() - definition position is wrong
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])

Upvotes: 1

Related Questions