Reputation: 8269
I'm trying to convert an UInt8
pandas series into the new StringDtype
.
I can do the following, covered in this question, which predates the new string
dtype:
import pandas as pd
int_series = pd.Series(range(20), dtype="UInt8")
obj_series = int_series.apply(str)
Which gives me a series of Object dtype containing strings.
But if I try to convert the series to the new string
dtype, I get an error:
>>> string_series = int_series.astype("string")
...
TypeError: data type not understood
Note that the first converting the series to Object
and then to string
dtype works:
int_series.apply(str).astype("string")
How can I convert the int series to string directly?
I'm using pandas version 1.0.3 on Python 3.7.6
Update: I've found this open issue in the pandas Github page that describes the exact same problem.
A comment in the issue above points to another open issue which covers the desired but currently not available functionality of converting between different ExtensionArray types.
So the answer is that the direct conversion cannot be done now, but likely will be possible in the future.
Upvotes: 4
Views: 1793
Reputation: 88236
This is explained in the docs, in the example section:
Unlike object dtype arrays, StringArray doesn’t allow non-string values
Where the following example is shown:
pd.array(['1', 1], dtype="string")
Traceback (most recent call last): ... ValueError: StringArray requires an object-dtype ndarray of strings.
The only solution seems to be casting to Object
dtype as you were doing and then to string.
This is also clearly stated in the source code of StringArray
, where right at the top you'll see the warning:
.. warning:: Currently, this expects an object-dtype ndarray where the elements are Python strings or :attr:`pandas.NA`. This may change without warning in the future. Use :meth:`pandas.array` with ``dtype="string"`` for a stable way of creating a `StringArray` from any sequence.
If you check the validation step in _validate
, you'll see how this will fail for arrays of non-strings:
def _validate(self):
"""Validate that we only store NA or strings."""
if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
if self._ndarray.dtype != "object":
raise ValueError(
"StringArray requires a sequence of strings or pandas.NA. Got "
f"'{self._ndarray.dtype}' dtype instead."
)
For the example in the question:
from pandas._libs import lib
lib.is_string_array(np.array(range(20)), skipna=True)
# False
Upvotes: 1