user281989
user281989

Reputation: 47

given a column with string data create a dataframe with ascii equivalent of each character in the string

I am trying to convert a list of strings to its ascii and place each character in columns in a dataframe. I have 30M such strings and I am running into memory issues with the code I'm running.

For example: strings = ['a','asd',1234,'ewq']

would like to get the following dataframe:

     0      1      2     3
0   97    0.0    0.0   0.0
1   97  115.0  100.0   0.0
2   49   50.0   51.0  52.0
3  101  119.0  113.0   0.0

What I have tried: pd.DataFrame([[ord(chr) for chr in list(str(rec))] for rec in strings]).fillna(0)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 404, in to_arrays
    dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 434, in _list_to_arrays
    content = list(lib.to_object_array(data).T)
  File "pandas/_libs/lib.pyx", line 2269, in pandas._libs.lib.to_object_array
MemoryError

Not sure if relevant but strings is actually a column from another dataframe with .values.

Also, the longest string is almost 255 characters long. I know 30M x 1000 is a big number. Any way I can get around this issue?

Upvotes: 0

Views: 957

Answers (2)

Andrew Lavers
Andrew Lavers

Reputation: 4378

This uses the pandas compressed datatype, but I only figured out how to apply it to the whole dataframe after it is built. NOTE: I assumed alls strings are strings not mixed integers and strings.

import pandas as pd
import numpy as np
strings = ['a','asd','1234','ewq']

# Find padding length
maxlen = max(len(s) for s in strings)

# Use 8 bit integer with pandas sparse data type, compressing zeros
dt = pd.SparseDtype(np.int8, 0)

# Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
# NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
# each series as the dataframe is built

sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
print(f"Memory used: {sdf.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null Sparse[int8, 0]
# 1    4 non-null Sparse[int8, 0]
# 2    4 non-null Sparse[int8, 0]
# 3    4 non-null Sparse[int8, 0]
# dtypes: Sparse[int8, 0](4)
# memory usage: 135.0 bytes
# Memory used: None

# The original uncompressed size
df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
print(f"Memory used: {df.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null int64
# 1    4 non-null int64
# 2    4 non-null int64
# 3    4 non-null int64
# dtypes: int64(4)
# memory usage: 208.0 bytes
# Memory used: None

Upvotes: 0

jottbe
jottbe

Reputation: 4521

Have you tried setting the datatype explicitely to uint8 and then to process the data in chunks? From your example code, I guess, you are implicitely using float32, which requires 4 times more memory.

E.g. if you write it to a csv file and your strings fit into memory, you could try the following code:

def prepare_list(string, n, default):
    size= len(string)
    res= [ord(char) for char in string[:n]]
    if size < n:
        res+= [default] * (n - size)
    return res

chunk_size= 10000 # number of strings to be processed per step
max_len= 4        # maximum number of columns (=characters per string)
column_names= [str(i+1) for i in range(max_len)] # column names used for the columns
with open('output.csv', 'wt*) as fp:
    while string_list:
        df= pd.DataFrame([prepare_list(s, max_len, 0) for s in string_list[:chunk_size]], dtype='uint8', columns=column_names)
        df.to_csv(fp, header=fp.tell() == 0, index=False)
        string_list= string_list[chunk_size:]

When you read the csv created like this, you need to take care, that you set the type to uint8 again to avoid the same problem and make sure, the file is read without turning the first column into an index. E.g. like this:

pd.read_csv('output.csv', dtype='uint8', index=False)

Upvotes: 1

Related Questions