Reputation: 47
I am trying to convert a list of strings to its ascii and place each character in columns in a dataframe. I have 30M such strings and I am running into memory issues with the code I'm running.
For example:
strings = ['a','asd',1234,'ewq']
would like to get the following dataframe:
0 1 2 3
0 97 0.0 0.0 0.0
1 97 115.0 100.0 0.0
2 49 50.0 51.0 52.0
3 101 119.0 113.0 0.0
What I have tried:
pd.DataFrame([[ord(chr) for chr in list(str(rec))] for rec in strings]).fillna(0)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 404, in to_arrays
dtype=dtype)
File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 434, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/lib.pyx", line 2269, in pandas._libs.lib.to_object_array
MemoryError
Not sure if relevant but strings
is actually a column from another dataframe with .values
.
Also, the longest string is almost 255 characters long. I know 30M x 1000 is a big number. Any way I can get around this issue?
Upvotes: 0
Views: 957
Reputation: 4378
This uses the pandas compressed datatype, but I only figured out how to apply it to the whole dataframe after it is built. NOTE: I assumed alls strings are strings not mixed integers and strings.
import pandas as pd
import numpy as np
strings = ['a','asd','1234','ewq']
# Find padding length
maxlen = max(len(s) for s in strings)
# Use 8 bit integer with pandas sparse data type, compressing zeros
dt = pd.SparseDtype(np.int8, 0)
# Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
# NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
# each series as the dataframe is built
sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
print(f"Memory used: {sdf.info()}")
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0 4 non-null Sparse[int8, 0]
# 1 4 non-null Sparse[int8, 0]
# 2 4 non-null Sparse[int8, 0]
# 3 4 non-null Sparse[int8, 0]
# dtypes: Sparse[int8, 0](4)
# memory usage: 135.0 bytes
# Memory used: None
# The original uncompressed size
df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
print(f"Memory used: {df.info()}")
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0 4 non-null int64
# 1 4 non-null int64
# 2 4 non-null int64
# 3 4 non-null int64
# dtypes: int64(4)
# memory usage: 208.0 bytes
# Memory used: None
Upvotes: 0
Reputation: 4521
Have you tried setting the datatype explicitely to uint8
and then to process the data in chunks?
From your example code, I guess, you are implicitely using float32
, which requires 4 times more memory.
E.g. if you write it to a csv file and your strings fit into memory, you could try the following code:
def prepare_list(string, n, default):
size= len(string)
res= [ord(char) for char in string[:n]]
if size < n:
res+= [default] * (n - size)
return res
chunk_size= 10000 # number of strings to be processed per step
max_len= 4 # maximum number of columns (=characters per string)
column_names= [str(i+1) for i in range(max_len)] # column names used for the columns
with open('output.csv', 'wt*) as fp:
while string_list:
df= pd.DataFrame([prepare_list(s, max_len, 0) for s in string_list[:chunk_size]], dtype='uint8', columns=column_names)
df.to_csv(fp, header=fp.tell() == 0, index=False)
string_list= string_list[chunk_size:]
When you read the csv
created like this, you need to take care, that you set the type to uint8
again to avoid the same problem and make sure, the file is read without turning the first column into an index. E.g. like this:
pd.read_csv('output.csv', dtype='uint8', index=False)
Upvotes: 1