Reinaldo Chaves
Reinaldo Chaves

Reputation: 995

In pandas, how to select phrases in a dataframe from word list or word set?

In Python3 and pandas I have the dataframe:

df_projetos_api_final.info()                                                      
<class 'pandas.core.frame.DataFrame'>                                                      
Int64Index: 93631 entries, 1 to 93667                                                      
Data columns (total 21 columns):                                                           
AnoMateria                       93631 non-null object                                     
CodigoMateria                    93631 non-null object                                     
DescricaoIdentificacaoMateria    93631 non-null object                                     
DescricaoSubtipoMateria          93631 non-null object                                     
IndicadorTramitando              93631 non-null object                                     
NomeCasaIdentificacaoMateria     93631 non-null object                                     
NumeroMateria                    93631 non-null object                                     
ApelidoMateria                   891 non-null object                                       
DataApresentacao                 93631 non-null object                                     
DataLeitura                      54213 non-null object                                     
EmentaMateria                    93631 non-null object                                     
ExplicacaoEmentaMateria          9461 non-null object                                      
IndicadorComplementar            93631 non-null object                                     
DescricaoNatureza                54352 non-null object                                     
NomeAutor                        93100 non-null object                                     
IndicadorOutrosAutores           93214 non-null object                                     
CodigoParlamentar                49786 non-null object                                     
NomeParlamentar                  49786 non-null object                                     
NomeCompletoParlamentar          49786 non-null object                                     
UfParlamentar                    45613 non-null object                                     
DescricaoSituacao                78783 non-null object                                     
dtypes: object(21)                                                                         
memory usage: 8.2+ MB 

The column "EmentaMateria" has in each row a series of sentences. I plan to create a new dataframe from rows that contain any or several of these words (or group of words) in this column:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"  

So I did like this:

seleciona2 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\
                               str.contains(str_choice, na=False)]

The newly generated dataframe has collected multiple sentences that have one or more of these words. However, many lines do not have these words, as
"ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A
PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO
PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO
DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00
(TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E
SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E
EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA."

Please, is this because similar words are also being searched for in sentences? Or because many sentences have too much white space or line breaks between some words?

Edit 7/12/2019

Thanks so much all friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem - I didn't know the original base had been changed

I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents and with the original database, and then it worked - I haven't checked all the lines yet, but all I've seen are correct so far

So the new str_choice (I used the name search_list), I used was this:

df_projetos_api_final['EmentaMateria'] = df_projetos_api_final['EmentaMateria'].str.upper()
search_list = ["MULHER", "MULHERES", "TRABALHO DOMÉSTICO", "VIOLÊNCIA CONTRA A MULHER", "VIOLÊNCIA DOMÉSTICA", "VIOLÊNCIA DE GÊNERO", "MARIA DA PENHA", "ABORTO", "ABORTAMENTO", "INTERRUPÇÃO DE GRAVIDEZ", "INTERRUPÇÃO DE GESTAÇÃO", "DIREITO REPRODUTIVO", "DIREITOS REPRODUTIVOS", "DIREITO À VIDA", "CONCEPÇÃO", "CONTRACEPÇÃO", "CONTRACEPTIVO", "MISOPROSTOL", "MIFEPRISTONE", "CYTOTEC", "ÚTERO", "GESTAÇÃO", "GRAVIDEZ", "PARTO", "VIOLÊNCIA OBSTÉTRICA", "FETO", "BEBÊ", "CRIANÇA", "VIOLÊNCIA SEXUAL", "FEMINICÍDIO", "MORTE DE MULHER", "MORTE DE MULHERES", "HOMICÍDIO DE MULHER", "HOMICÍDIO DE MULHERES", "ASSÉDIO SEXUAL", "ASSÉDIO", "ESTUPRO", "VIOLÊNCIA SEXUAL", "ABUSO SEXUAL", "ESTUPRO DE VULNERÁVEL", "LICENÇA MATERNIDADE", "FEMININO", "MULHER NEGRA", "MULHERES NEGRAS", "MULHERES QUILOMBOLAS", "MULHERES INDÍGENAS", "NEGRAS", "NEGRA", "RACISMO", "RAÇA", "RACIAL", "ABUSO SEXUAL", "MATERNIDADE", "MÃE", "AMAMENTAÇÃO", "SEXUALIDADE", "SEXO", "GÊNERO", "FEMINISMO", "MACHISMO", "GUARDA DE FILHOS", "GUARDA DOS FILHOS", "IGUALDADE DE GÊNERO", "IDENTIDADE DE GÊNERO", "IDEOLOGIA DE GÊNERO", "EDUCAÇÃO SEXUAL", "ESCOLA SEM PARTIDO", "TRANSEXUAL", "TRANSEXUALIDADE", "MULHER TRANS", "MULHERES TRANS", "MUDANÇA DE SEXO", "READEQUAÇÃO SEXUAL", "EXPLORAÇÃO SEXUAL", "PROSTITUIÇÃO", "ORIENTAÇÃO SEXUAL", "HOMOSSEXUAL", "HOMOSSEXUALIDADE", "HOMOSSEXUALISMO",  "LÉSBICA",  "LÉSBICAS",  "DIREITO DOS HOMENS", "EDUCAÇÃO RELIGIOSA",  "DEUS", "RELIGIÃO", "EDUCACÃO DOMICILIAR", "HOMESCHOOLING", "CRECHE",  "EDUCAÇÃO INFANTIL",  "CASAMENTO INFANTIL"]
mask = df_projetos_api_final['EmentaMateria'].str.contains('|'.join(search_list))
seleciona = df_projetos_api_final[mask]
seleciona.info()

Upvotes: 5

Views: 4241

Answers (4)

Anthony R
Anthony R

Reputation: 2939

Possible Solution:

"Create a new dataframe from rows that contain any or several of these words (or group of words) in this column:"


Step 1: Split the column into a new DataFrame

# Assume that each 'sentence' is comma-separated.
df_split = df_projetos_api_final.EmentaMateria.split(",", expand=True)

Step 2: Search for sub-string

# Set regex=False if you want the entire word to match vs a sub-string. 
rows_that_match = df_split[df_split.str.contains(str_choice, case=False)]

Upvotes: 0

AMC
AMC

Reputation: 2702

Diagnosis

First, let's take a look at why your code doesn't work. @jorijnsmit gave it away (and shared a useful answer), your regex matches characters regardless of where they are. Let us illustrate with a simpler example, which I will be using throughout:

We want to match the words 'app' and 'he', so we construct a regex much like yours.

strings_to_match = ['app', 'he']

match_pattern = '|'.join(strings_to_match) # "app|he"

We join the strings we want to match using the alternation operator and we're good to go, right? Thanks to the magic of regex101, here are the results of applying our pattern to a few strings (matches are in square brackets):

  • [he]llo
  • brot[he]r
  • [app]lication
  • [he]
  • [app]le
  • h[app]ier
  • [app]
  • [he]ll

Our pattern matches the strings 'app' and 'he' anywhere, when we want just the words themselves!


RegEx Solution

What can we do to fix this? Our first thought may be to change our pattern to ' app | he ', which does fix the issue with strings like 'application'. Unfortunately, this isn't foolproof. That pattern fails to recognize the word 'app' in 'I downloaded an app.', which is perfectly valid to us. Fortunately for us, regex has just the solution we need: Word boundaries, represented by the token '\b', whose principle is rather self-explanatory.

Here are some results for the new pattern, '\bapp\b|\bhe\b':

  • '[he]'
  • 'apple'
  • 'happier'
  • ' [app] '
  • 'hell'
  • 'I downloaded an [app]!'

Exactly what we expect! While it does work correctly, that pattern is needlessly difficult to read. We can use only one set of word boundary tokens by putting all our substrings into a non-capturing group: '\b(?:app|he)\b'. A capturing group, well, groups and captures a subset of a regex. In this case, the group would return the same thing as the entire match. A non-capturing group eliminates that redundancy while still allowing us to logically separate part of our expression.

Here is a complete program which demonstrates constructing the pattern and using it on a Pandas Series:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

match_pattern = fr"\b(?:{'|'.join(strings_to_match)})\b"  # "\b(?:app|he)\b"

match_res = test_series.str.contains(match_pattern, case=False)

Output of print(match_res):

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8     True
9     True
dtype: bool

A few notes on other solutions

1.

Please note that these methods can only match words, not arbitrary substrings. As such, they are not actually valid solutions for this particular problem, and are only covered here for the sake of completeness.

This is the same style as the solution by @FBruzzesi, which we will call version 1. For reference:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]

While their solution collects all the matches, we are only concerned with whether or not there is a match in. Let's look at a refactored version of their solution, version 2:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

series_split = test_series.str.split()

match_res = series_split.map(lambda curr_words: any((curr_sub in curr_words for curr_sub in strings_to_match)))

Unlike version 1, version 2 keeps the number of split() operations, which were a real cause for concern (I estimated around 8,000,000 split() operations for the entire column), to a minimum. It should also be more efficient since the number of iterations depends on the number of substrings to match, which should often be lower than the number of words in the string to check.

2.

I have seen a few mentions of the regex parameter which I feel are unclear or misleading. Yes, passing regex=False will match a literal string, no, just changing the parameter won't make your current code work (why would it?).


I hope this is the sort of thing you had in mind when you asked for a canonical answer. Let me know if anything is unclear or you have any further questions :)

Upvotes: 4

FBruzzesi
FBruzzesi

Reputation: 6485

For an exact match this workaround works:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]

Upvotes: 2

gosuto
gosuto

Reputation: 5741

The docs for .contains() mention you could use the stricter .match() instead as it is based on re.match instead of re.search.

For an explanation between the two see for example this thread: What is the difference between re.search and re.match?.

Edit: Just for kicks I tried finding out which pattern was matched exactly:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"

df = pd.DataFrame(['ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00 (TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA.'.split()])

df.T[0][df.T[0].str.contains(str_choice)]

Returns:

18    ALTERACAO
Name: 0, dtype: object

This returns because it contains the substring 'RACA'. If you set regex=False this does not happen; it will look for the full string.

Upvotes: 3

Related Questions