Will
Will

Reputation: 737

Python Multiprocessing & ctype arrays

I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.

I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.

The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.

Currently I have:

import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array

with open('test.txt', 'r') as fin:
    rawdata = Array('c', fin.readlines(), lock=False)

But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?

Upvotes: 0

Views: 410

Answers (1)

Nate S.
Nate S.

Reputation: 126

Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).

Upvotes: 1

Related Questions