given a column with string data create a dataframe with ascii equivalent of each character in the string

Question

I am trying to convert a list of strings to its ascii and place each character in columns in a dataframe. I have 30M such strings and I am running into memory issues with the code I'm running.

For example: strings = ['a','asd',1234,'ewq']

would like to get the following dataframe:

     0      1      2     3
0   97    0.0    0.0   0.0
1   97  115.0  100.0   0.0
2   49   50.0   51.0  52.0
3  101  119.0  113.0   0.0

What I have tried: pd.DataFrame([[ord(chr) for chr in list(str(rec))] for rec in strings]).fillna(0)

Error:

Traceback (most recent call last):
  File "", line 1, in 
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 404, in to_arrays
    dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 434, in _list_to_arrays
    content = list(lib.to_object_array(data).T)
  File "pandas/_libs/lib.pyx", line 2269, in pandas._libs.lib.to_object_array
MemoryError

Not sure if relevant but strings is actually a column from another dataframe with .values.

Also, the longest string is almost 255 characters long. I know 30M x 1000 is a big number. Any way I can get around this issue?

Andrew Lavers · Accepted Answer

This uses the pandas compressed datatype, but I only figured out how to apply it to the whole dataframe after it is built. NOTE: I assumed alls strings are strings not mixed integers and strings.

import pandas as pd
import numpy as np
strings = ['a','asd','1234','ewq']

# Find padding length
maxlen = max(len(s) for s in strings)

# Use 8 bit integer with pandas sparse data type, compressing zeros
dt = pd.SparseDtype(np.int8, 0)

# Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
# NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
# each series as the dataframe is built

sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
print(f"Memory used: {sdf.info()}")

# 
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null Sparse[int8, 0]
# 1    4 non-null Sparse[int8, 0]
# 2    4 non-null Sparse[int8, 0]
# 3    4 non-null Sparse[int8, 0]
# dtypes: Sparse[int8, 0](4)
# memory usage: 135.0 bytes
# Memory used: None

# The original uncompressed size
df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
print(f"Memory used: {df.info()}")

# 
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null int64
# 1    4 non-null int64
# 2    4 non-null int64
# 3    4 non-null int64
# dtypes: int64(4)
# memory usage: 208.0 bytes
# Memory used: None

given a column with string data create a dataframe with ascii equivalent of each character in the string

Answers (2)

Related Questions