Reputation: 4429
I'm having trouble with this .contains
function for this df
. Why doesn't it match my string? Clearly the df
has the string. It matches "Chief" alone.
import pandas as pd
link = 'https://www.sec.gov/Archives/edgar/data/1448056/000119312518215760/d619223ddef14a.htm'
ceo = 'Chief Executive Officer'
df_list = pd.read_html(link)
df = df_list[62]
df = df.fillna('')
for column in df:
if column == 4:
print ('try #1', df[column].str.contains(ceo, case=True, regex=True))
print ('try #2', df[column].str.contains(ceo, case=True, regex=False))
print ('try #3', df[column].str.contains(ceo, regex=False))
print ('try #4', df[column].str.contains(ceo, regex=True))
print ('try #5', df[column].str.contains(pat=ceo, regex=False))
print ('try #6', df[column].str.contains(pat=ceo, case=True, regex=True))
Upvotes: 0
Views: 128
Reputation: 1196
The problem is the encoding, you can see it if you do:
df[4].iloc[2]
because it prints:
'Founder,\xa0Chief\xa0Executive\xa0Officer,\xa0and\xa0Director'
And to fix it, use unidecode:
import unidecode
for column in df.columns:
if column == 4:
print ('try #1', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(ceo, case=True, regex=True))
print ('try #2', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(ceo, case=True, regex=False))
print ('try #3', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(ceo, regex=False))
print ('try #4', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(ceo, regex=True))
print ('try #5', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(pat=ceo, regex=False))
print ('try #6', df[column].apply(lambda x:
unidecode.unidecode(x)).str.contains(pat=ceo, case=True, regex=True))
Upvotes: 1