Reputation: 995
In Python 3 and pandas I have this dataframe
eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 21 columns):
uf_x 47490 non-null object
partido_eleicao_x 47490 non-null object
cargo_x 47490 non-null object
nome_completo_x 47490 non-null object
cpf 47490 non-null object
cpf_cnpj_doador 47490 non-null object
nome_doador 47490 non-null object
valor 47490 non-null object
tipo_receita 47490 non-null object
fonte_recurso 47490 non-null object
especie_recurso 47490 non-null object
cpf_cnpj_doador_originario 47490 non-null object
nome_doador_originario 47490 non-null object
tipo_doador_originario 47490 non-null object
Unnamed: 0 47490 non-null int64
uf_y 47490 non-null object
cargo_y 47490 non-null object
nome_completo_y 47490 non-null object
nome_urna 47490 non-null object
partido_eleicao_y 47490 non-null object
situacao 47490 non-null object
dtypes: int64(1), object(20)
memory usage: 8.0+ MB
I used this command to create a new column with the first eight characters of column "cpf_cnpj_doador"
eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]
This correctly truncated many of the lines: "01888360712" became "01888360"
But there are many lines that did not truncate correctly, instead, the expected value was replaced with "NaN", incorrectly: "50844182000155" became NaN (here the correct value would be "50844182")
Does anyone know the origin of the NaN content?
Here are the commands I wrote to create the columns. Then I selected a portion of the data that has errors and hits
eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]
eleitos_d_doadores['cnpj_raiz_doador_originario'] = eleitos_d_doadores.cpf_cnpj_doador_originario.str[:8]
eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 23 columns):
uf_x 47490 non-null object
partido_eleicao_x 47490 non-null object
cargo_x 47490 non-null object
nome_completo_x 47490 non-null object
cpf 47490 non-null object
cpf_cnpj_doador 47490 non-null object
nome_doador 47490 non-null object
valor 47490 non-null object
tipo_receita 47490 non-null object
fonte_recurso 47490 non-null object
especie_recurso 47490 non-null object
cpf_cnpj_doador_originario 47490 non-null object
nome_doador_originario 47490 non-null object
tipo_doador_originario 47490 non-null object
Unnamed: 0 47490 non-null int64
uf_y 47490 non-null object
cargo_y 47490 non-null object
nome_completo_y 47490 non-null object
nome_urna 47490 non-null object
partido_eleicao_y 47490 non-null object
situacao 47490 non-null object
cnpj_raiz_doador 3488 non-null object
cnpj_raiz_doador_originario 47490 non-null object
dtypes: int64(1), object(22)
memory usage: 8.7+ MB
nome = eleitos_d_doadores[(eleitos_d_doadores['nome_completo_x'] == 'JULIO CESAR DELGADO')]
nome.loc[:, ['cpf_cnpj_doador', 'cnpj_raiz_doador']]
cpf_cnpj_doador cnpj_raiz_doador
7390 1421697000137 NaN
7391 1421697000137 NaN
7392 1421697000137 NaN
7393 1421697000137 NaN
7394 56993900000131 NaN
7395 26198515000484 NaN
7396 26198515000484 NaN
7397 20574428000155 NaN
7398 12082605000158 NaN
7399 60892403000114 NaN
7400 17469701000177 NaN
7401 66460080000176 NaN
7402 21561725000129 NaN
7403 50844182000155 NaN
7404 3940864000181 NaN
7405 3940864000181 NaN
7406 3940864000181 NaN
7407 3940864000181 NaN
7408 3940864000181 NaN
7409 3940864000181 NaN
7410 3940864000181 NaN
7411 00697656691 00697656
7412 03776208660 03776208
7413 16760808649 NaN
7414 17153081000162 NaN
7415 20573722000142 NaN
7416 20573722000142 NaN
7417 20573722000142 NaN
7418 20573722000142 NaN
7419 20592604000181 NaN
7420 20573722000142 NaN
7421 15102288000182 NaN
7422 33131541000108 NaN
7423 20575279000149 NaN
7424 20575492000150 NaN
nome.loc[:, ['cpf_cnpj_doador_originario', 'cnpj_raiz_doador_originario']]
cpf_cnpj_doador_originario cnpj_raiz_doador_originario
7390 17262213000194 17262213
7391 90400888000142 90400888
7392 16639904000100 16639904
7393 00447821000170 00447821
7394 #NULO #NULO
7395 #NULO #NULO
7396 #NULO #NULO
7397 38105195100 38105195
7398 #NULO #NULO
7399 #NULO #NULO
7400 #NULO #NULO
7401 #NULO #NULO
7402 #NULO #NULO
7403 #NULO #NULO
7404 61186888000193 61186888
7405 15102288000182 15102288
7406 92693118000160 92693118
7407 92693118000160 92693118
7408 02125403000192 02125403
7409 33000092000169 33000092
7410 07052569000140 07052569
7411 #NULO #NULO
7412 #NULO #NULO
7413 #NULO #NULO
7414 #NULO #NULO
7415 03349915000103 03349915
7416 17463456000190 17463456
7417 71077747000196 71077747
7418 03349915000103 03349915
7419 04899037000154 04899037
7420 06142647000134 06142647
7421 #NULO #NULO
7422 #NULO #NULO
7423 04641376000136 04641376
7424 08250286634 08250286
Upvotes: 0
Views: 458
Reputation: 909
You can use the pandas.DataFrame.dropna method to avoid the NaN values:
DataFrame.dropna(subset=['ColumnToCheck'], how='all', inplace=True)
Upvotes: 1