Reputation: 3
When attempting to convert a column in Python 3 from an object to a string, the code I am using doesn't error, but it also doesn't change the type.
import pandas as pd
import numpy as np
import nltk
import os
import nltk
nltk.download('punkt')
import nltk.corpus
import re
#Read in fields
jan = pd.read_excel(r'C:\Users\Sabrina\JIRA\2019\2019_jan.xls')
#Indicate columns for performing tokenization
jan_a = pd.DataFrame(jan, columns= ['Summary'])
#Tokenize columns for text analysis
jan_a['Summary'] = jan_a.apply(lambda column:
nltk.word_tokenize(column['Summary']), axis=1)
print(jan_a)
print(jan_a['Summary'].dtypes)
#Convert list to string
jan_a['Summary'].astype('str')
print(jan_a['Summary'].dtypes)
The output for both dtypes is object, any assistance would be appreciated!
Upvotes: 0
Views: 554
Reputation: 8510
The default behavior is to treat python str as object by default
>>> import pandas as pd
>>> df = pd.DataFrame(["aa1 bb2 cc3".split(),"aa4 bb5 cc6".split()],columns="col1 col2 col3".split())
>>>
>>> df
col1 col2 col3
0 aa1 bb2 cc3
1 aa4 bb5 cc6
>>> df["col1"]
0 aa1
1 aa4
Name: col1, dtype: object
>>>
you need to explicit told it that you want string either on creation by adding a dtype="string"
>>> df2 = pd.DataFrame(["aa1 bb2 cc3".split(),"aa4 bb5 cc6".split()],dtype="string",columns="col1 col2 col3".split())
>>> df2
col1 col2 col3
0 aa1 bb2 cc3
1 aa4 bb5 cc6
>>> df2["col1"]
0 aa1
1 aa4
Name: col1, dtype: string
>>>
or by later transforming it with astype
>>> df["col1"].astype("string")
0 aa1
1 aa4
Name: col1, dtype: string
>>>
link to the relevant part of the documentation for more detail: https://pandas.pydata.org/docs/user_guide/text.html#text-types
Upvotes: 0
Reputation: 2804
Try to change:
jan_a['Summary'].astype('str')
to
jan_a['Summary'] = jan_a['Summary'].astype('string')
Upvotes: 1