ndeslandes
ndeslandes

Reputation: 195

Process hangs while trying to push Azure AppInsights metric

I'm using the Application Insights Python API to publish a custom metric for my application every 30 s. This works fine for a while (up to several days), but then my Python script just hangs while trying to flush the data to Azure.

The Python code itself is fairly simple, and just this infinite loop:

while True:
    count = get_connection_count()
    if count is not None:
        tc.track_metric("ConnectionCount", count, type=DataPointType.measurement, count=1)
        tc.flush()
    time.sleep(10)

A stack trace (below) shows the process is stuck on tc.flush(), waiting from an answer from the server.

If I look at the TCP connections for the process, I can see the process still has an open TCP connection to Azure; it just not getting any reply. Has anyone encountered a similar issue? What would cause the Azure AppInsights to stop responding like this?

Alternatively, can a timeout be defined for the tc.flush call, so I can at least recover from an unresponsive endpoint?

Here's the stack trace I was able to extract:

  File "/var/lib/app-monitor/connectionMonitor.py", line 52, in <module>
        tc.flush()
  File "/usr/local/lib/python2.7/dist-packages/applicationinsights/TelemetryClient.py", line 55, in flush
        self._channel.flush()
  File "/usr/local/lib/python2.7/dist-packages/applicationinsights/channel/TelemetryChannel.py", line 71, in flush
        self._queue.flush()
  File "/usr/local/lib/python2.7/dist-packages/applicationinsights/channel/SynchronousQueue.py", line 39, in flush
        local_sender.send(data)
  File "/usr/local/lib/python2.7/dist-packages/applicationinsights/channel/SenderBase.py", line 118, in send
        response = HTTPClient.urlopen(request)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
        return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
        response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
        '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
        result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
        return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
        r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
        response.begin()
  File "/usr/lib/python2.7/httplib.py", line 444, in begin
        version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status
        line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
        data = self._sock.recv(self._rbufsize)
  File "/usr/lib/python2.7/ssl.py", line 341, in recv
        return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 260, in read
        return self._sslobj.read(len)

Upvotes: 0

Views: 1156

Answers (2)

John Gardner
John Gardner

Reputation: 25126

After some discussion internally, there's a workaround, though not really a fix: make sure that sockets have some kind of default timeout value to prevent them from hanging forever:

import socket
socket.setdefaulttimeout(30)

note that this applies to any+all http calls from the script, so it isn't necessarily ideal, but does prevent things from hanging for a long long time.

Upvotes: 0

Peter Pan
Peter Pan

Reputation: 24138

Per my experience, there may be two reasons which will causing the issue.

  1. Some limits on the number of metrics and events were exceeded in your application, please refer to the offical document and catch the responce status code via Wireshark or Fiddler on Linux to check it. There are some error codes for this case which include 402 (Payment required), 429 (Too many requests), 503 (Service unavailable), etc.

  2. You can always get information for Application Insights on that health and status of the service at http://aka.ms/aistatus to check whether the issue was caused by some operations for planned maintenance or issue resolving.

Hope it helps.

Upvotes: 0

Related Questions