Bug 1319026 introduced logs to try and nail down what kinds of failures users experience when trying to send Telemetry pings. Let’s see what we’ve managed to collect.

import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py

from plotly.graph_objs import *
from moztelemetry import get_pings_properties, get_one_ping_per_client
from moztelemetry.dataset import Dataset

%matplotlib inline
Unable to parse whitelist (/mnt/anaconda2/lib/python2.7/site-packages/moztelemetry/histogram-whitelists.json). Assuming all histograms are acceptable.
pings = Dataset.from_source("telemetry") \
    .where(docType='main') \
    .where(appUpdateChannel='nightly') \
    .where(submissionDate=lambda x: x >= "20170429") \
    .where(appBuildId=lambda x: x >= '20170429') \
    .records(sc, sample=1)
subset = get_pings_properties(pings, ["clientId",
log_entries = subset\
    .flatMap(lambda p: [] if p['payload/log'] is None else [l for l in p['payload/log'] if l[0] == 'TELEMETRY_SEND_FAILURE'])
log_entries = log_entries.cache()
error_counts = l: (tuple(l[2:]), 1)).countByKey()
entries_count = log_entries.count()
sorted(map(lambda i: ('{:.2%}'.format(1.0 * i[-1] / entries_count), i), error_counts.iteritems()), key=lambda x: x[1][1], reverse=True)
[('72.16%', ((u'errorhandler', u'error'), 530178)),
 ('27.04%', ((u'errorhandler', u'timeout'), 198698)),
 ('0.73%', ((u'5xx failure', u'504'), 5327)),
 ('0.07%', ((u'errorhandler', u'abort'), 530)),
 ('0.00%', ((u"4xx 'failure'", u'403'), 7)),
 ('0.00%', ((u'5xx failure', u'502'), 3))]


Alrighty, looks like we’re mostly “error”. Not too helpful, but does narrow things down a bit.

“timeout” is the reason for more than one in every four failures. That’s a smaller cohort than I’d originally thought.

A few Gateway Timeouts (504) which could be server load, very few aborts, and essentially no Forbidden (403) or Bad Gateway (502).