haben
haben

Reputation: 65

Python - UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 229393:: character maps to <undefined>

I have tried to scrape a website using Python and Selenium.

Here is the partial code:

    def data_html_text(self): #Downloads page source code
        Xyz_page_source = self.driver.page_source
        with open(self.Html_source, 'w', encoding="utf-8") as file:
             file.write(Xyz_page_source)


    def email_parser(self): # gets scraped links and filters it 
        count = 0

        file = open(self.Html_source)
        data = file.read()
        soup = BeautifulSoup(data, 'lxml')
        all_divs = soup.find_all('li',class_='badgeList__item',)
        scrapper_links = [self.Base_url + a_href.div.div.a['href'] for a_href in all_divs]

        for link in scrapper_links:
            count += 1
            print("{} ------> {}".format(count,link))

        count = 0

        data = []
        for s_link in scrapper_links:
            user_page = requests.get(s_link, headers=self.headers)
            text = user_page.content
            inner_pagee = text.decode()
            all_emails = re.findall(r'[w\w.-]+@[\w\.-]+', inner_pagee)
            if all_emails:
                count += 1
                print("{} Scraping Emails: {}".format(count, all_emails[0]))
                data.append(all_emails[0])
                new_data = list(set(data))

        data1 =[]
        for x in new_data:
            x = re.sub('[.]$','',x)
            data1.append(x)
        print(data1)


        with open('test.csv', "w", encoding="utf-8") as output:
            writer = csv.writer(output, lineterminator='\n')
            for val in data1:
                writer.writerow([val])

But I kept getting the following Error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 229393: character maps to

Any idea on how to solve this?

Upvotes: 1

Views: 4760

Answers (1)

Kristie
Kristie

Reputation: 132

File you are opening is not utf-8 format, please check the format (encoding) and use that instead of utf-8.

Try

  encoding='utf-8-sig'

Upvotes: 1

Related Questions