Reputation: 493
I'm reading lists from a large file, which I eventually want to store as array.array
s. Because
map(int, line.split())
is very slow, I wrote a small C module which does strtok and a faster version of atoi:
inline long
minhashTables_myatoi(const char* s)
{
int r;
for (r = 0; *s; r = r * 10 + *s++ - '0');
return r;
}
static PyObject*
minhashTables_ints(PyObject *self, PyObject *args)
{
char* s;
Py_ssize_t slen;
if(!PyArg_ParseTuple(args, "s#", &s, &slen))
return NULL;
long* buf = malloc(sizeof(long) * (slen+1)/2);
const char* tok = strtok(s, " ");
buf[0] = minhashTables_myatoi(tok);
Py_ssize_t i;
for(i = 1; (tok = strtok(NULL, " ")) != NULL; i++)
buf[i] = minhashTables_myatoi(tok);
Py_ssize_t buflen = i;
PyObject* list = PyList_New(buflen);
PyObject *o;
for(i = 0; i < buflen; i++)
{
o = PyInt_FromLong(buf[i]);
PyList_SET_ITEM(list, i, o);
}
free(buf);
return list;
}
So my python script calls ints()
with a string and passes it to the array.array
constructor and saves the resulting array in a list
.
My problem is, that now the script leaks memory, which it did not with the map instead of the ints()
function, of course.
Also using my own version of Pythons int()
using a C module does not leak memory.
Thanks for your help!
Edit: To valgrind the module I used this script:
import minhashTables
data = ' '.join(map(str, range(10)))
print 'start'
foo = minhashTables.ints(data)
del data
del foo
print 'stop'
And I run valgrind --tool=memcheck --leak-check=full --show-reachable=yes python test.py
, but there so no output from valgrind between start
and stop
, through there are tons before and afterwards.
Edit: Code for confirming it's leaking: import minhashTables
for i in xrange(1000000000):
data = ' '.join(map(str, range(10, 10000)))
foo = minhashTables.ints(data)
I have to recreate the string, because strtok changes it. By the way, copying the string into another memory location doesn't change the behavior.
Upvotes: 2
Views: 508
Reputation: 1
Try this
inline long
minhashTables_myatoi(const char* s)
{
long result=0;
while((*s)!='\0'){
result = result * 10 + (*s- '0');
s++;
}
return result;
}
Upvotes: -1
Reputation: 10350
Do you really need to malloc
off space for all those long
s?
I'm not familiar with the Python/C API, so this might be terrible advice, but can't you just iterate over the string and append each long you find onto the list as you go?
i.e. take this code:
static const char* const testString = "12 345 67 8 910 11 1213 141516, 1718";
int main()
{
const char* i = testString;
long parseLong = 0;
int gotLong = 0;
for (;*i;++i)
{
if ('0' <= *i && *i <= '9')
{
parseLong = (parseLong * 10) + (*i - '0');
gotLong = 1;
}
else if (gotLong)
{
printf("Got: %d\n", parseLong);
parseLong = 0;
gotLong = 0;
}
}
if (gotLong)
printf("Got: %d\n", parseLong);
}
And then replace the printf
with some suitable pythony-goodness like PyList_Append()
.
As well as avoiding malloc
, using less memory and being able to safely operate directly on the constant Python string, this code also handles corner cases like empty strings, multiple spaces and other separators between numbers.
for (i = testString;*i;++i)
{
const int isdigitoflong = isdigit(*i);
if (!gotLong && isdigitoflong)
longCount++;
gotLong = isdigitoflong;
}
Which should be relatively quick.
gotLong
and doesn't have to repeat code to deal with the final long:
for (i = testString;*i;++i)
{
if (isdigit(*i))
{
do {
parseLong = (parseLong * 10) + (*i - '0');
} while (*++i && isdigit(*i));
printf("Got: %d\n", parseLong);
parseLong = 0;
}
}
Upvotes: 1