Reputation: 2370
I noticed a very large increase in memory usage when retrieving a pdf file using requests library. The file itself is ~4MB large but physical memory allocated to python processes increases by more than 150MB !
Is anyone aware of the possible causes (and maybe fixes) of such behavior?
This is the test case:
import requests,gc
def dump_mem():
s = open("/proc/self/status").readlines()
for line in s:
if line.startswith("VmRSS"):
return line
Below is the output I got in intrepretter.
>>> gc.collect()
0
>>> dump_mem()
'VmRSS:\t 13772 kB\n'
>>> gc.collect()
0
>>> r = requests.get('http://www.ipd.uni-karlsruhe.de/~ovid/Seminare/DWSS05/Ausarbeitungen/Seminar-DWSS05')
>>> gc.collect()
5
>>> dump_mem()
'VmRSS:\t 20620 kB\n'
>>> r.headers['content-length']
'4089190'
>>> dump_mem()
'VmRSS:\t 20628 kB\n'
>>> gc.collect()
0
>>> c = r.content
>>> dump_mem()
'VmRSS:\t 20628 kB\n'
>>> gc.collect()
0
>>> t = r.text
>>> gc.collect()
8
>>> dump_mem()
'VmRSS:\t 182368 kB\n'
Obviously I shouldn't try to decode a pdf file as text. But what is the cause of such behavior anyway?
Upvotes: 1
Views: 8822
Reputation: 1121844
When no charset
parameter is included in the content type and the response is not a text/
mimetype, then a character detection library is used to determine the codec.
By using response.text
you triggered this detection, loading the library, and it's modules include some sizable tables.
Depending on the exact version of requests
you have installed, you'll find that sys.modules['requests.packages.chardet']
or sys.modules['requests.packages.charade']
is now present, together with around 36 of sub-modules, where it wasn't before you used r.text
.
As the detection runs, a number of objects are created that use various statistical techniques on your PDF document, as detection fails to hit on any specific codec with enough certainty. To fit all this in memory, Python requests more memory to be allocated from your OS. Once the detection process is complete, that memory is freed again, but the OS does not then de-allocate that memory, not immediately. This is done to prevent wild memory churn as processes can easily request and free memory in cycles.
Note that you also added the result of r.text
to your memory, bound to t
. This is a Unicode text object, which in Python 2 takes up between 2 and 4 times as much memory as the bytestring object. The specific download you have there is nearly 4 MB as a bytestring, but if you are using a UCS-4 Python build, then the resulting Unicode value adds another 16MB just for the decoded value.
Upvotes: 7