Getting polish characters in python

Question

using the below code I extracted the table and made it into a csv file

import pandas as pd
x = pd.read_html('https://www.ebmia.pl/lozyska-kulkowe-zwykle-seria-c-196_140_1328_1282_3375.html')[0]
x.to_csv('file.csv')

While doing this all the polish characters like "Ł" is converted into "?" in csv file.

How can I get the original polish characters in csv file.

jezrael · Accepted Answer

I think you need paramter encoding='utf-8', for return first parsed table add [0], because read_html return list of DataFrames:

url = 'https://www.ebmia.pl/lozyska-kulkowe-zwykle-seria-c-196_140_1328_1282_3375.html'
df = pd.read_html(url, encoding='utf-8')[0]

Some data cleaning:

#remove first level in columns with filters in html
df.columns = df.columns.droplevel(1)
#replace NaN by forward filling
df['Zdjęcie'] = df['Zdjęcie'].ffill()
#remove NaNs rows by checking Wewnętrzny mm  column
df = df.dropna(subset=['Wewnętrzny mm '])
print (df.head())
                                              Zdjęcie Oznaczenie  ⇓  \
4   Łożysko kulkowe zwykłe 16001 NSK - (symbol: L0...         16001   
7   Łożysko kulkowe zwykłe 16001 ZZ FAG - (symbol:...      16001 2Z   
10  Łożysko kulkowe zwykłe 16002.SKF - (symbol: L0...         16002   
13  Łożysko kulkowe zwykłe 16002-A.FAG - (symbol: ...         16002   
16  Łożysko kulkowe zwykłe 16002 - (symbol: L0101-...         16002   

    Wewnętrzny mm   Zewnętrzny mm   Szerokość / wysokość mm  Zabudowa  Luz   \
4           1200.0          2800.0                     700.0         -    -   
7           1200.0          2800.0                     700.0        2Z    -   
10          1500.0          3200.0                     800.0         -    -   
13          1500.0          3200.0                     800.0         -    -   
16          1500.0          3200.0                     800.0         -    -   

                   Producent Cena(brutto)  
4   NSK BEARINGS POLSKA S.A.     22,14 zł  
7                        NaN          NaN  
10                       NaN     31,34 zł  
13                       NaN     17,11 zł  
16                       NaN      5,40 zł

If need write output to csv:

df.to_csv('file', encoding='utf-8', index=False)

Same parameter working for read_csv:

df = pd.read_csv('file.csv', encoding='utf-8')

Getting polish characters in python

Answers (1)

Related Questions