graphid.core.refresh module

class graphid.core.refresh.RefreshCriteria(window=20, patience=72, thresh=0.1, method='binomial')[source]

Bases: object

Determine when to re-query for candidate edges.

Models an upper bound on the probability that any of the next patience reviews will be label-changing (meaningful). Once this probability is below a threshold the criterion triggers. The model is either binomial or poisson. They both work about the same. The binomial is a slightly better model.

Does this by maintaining an estimate of the probability any particular review will be label-chaging using an exponentially weighted moving average. This is the rate parameter / individual event probability.

clear()[source]
check()[source]
prob_any_remain(n_remain_edges=None)[source]
_prob_none_remain(n_remain_edges=None)[source]
pred_num_positives(n_remain_edges)[source]

Uses poisson process to estimate remaining positive reviews.

Multipling mu * n_remain_edges gives a probabilistic upper bound on the number of errors remaning. This only provides a real estimate if reviewing in a random order

Example

>>> # ENABLE_DOCTEST
>>> from graphid.core.refresh import *  # NOQA
>>> from graphid import demo
>>> infr = demo.demodata_infr(num_pccs=50, size=4, size_std=2)
>>> edges = list(infr.ranker.predict_candidate_edges(infr.aids, K=100))
>>> #edges = util.shuffle(sorted(edges), rng=321)
>>> scores = np.array(infr.verifier.predict_edges(edges))
>>> sortx = scores.argsort()[::-1]
>>> edges = list(ub.take(edges, sortx))
>>> scores = scores[sortx]
>>> ys = infr.match_state_df(edges)[POSTV].values
>>> y_remainsum = ys[::-1].cumsum()[::-1]
>>> refresh = RefreshCriteria(window=250)
>>> n_pred_list = []
>>> n_real_list = []
>>> xdata = []
>>> for count, (edge, y) in enumerate(zip(edges, ys)):
>>>     refresh.add(y, user_id='user:oracle')
>>>     n_remain_edges = len(edges) - count
>>>     n_pred = refresh.pred_num_positives(n_remain_edges)
>>>     n_real = y_remainsum[count]
>>>     if count == 2000:
>>>         break
>>>     n_real_list.append(n_real)
>>>     n_pred_list.append(n_pred)
>>>     xdata.append(count + 1)
>>> # xdoctest: +REQUIRES(--show)
>>> import plottool_ibeis as pt
>>> pt.qtensure()
>>> n_pred_list = n_pred_list[10:]
>>> n_real_list = n_real_list[10:]
>>> xdata = xdata[10:]
>>> pt.multi_plot(xdata, [n_pred_list, n_real_list], marker='',
>>>               label_list=['pred', 'real'], xlabel='review num',
>>>               ylabel='pred remaining merges')
>>> stop_point = xdata[np.where(y_remainsum[10:] == 0)[0][0]]
>>> pt.gca().plot([stop_point, stop_point], [0, int(max(n_pred_list))], 'g-')
add(meaningful, user_id, decision=None)[source]
ave(method='exp')[source]
>>> from graphid import demo
>>> infr = demo.demodata_infr(num_pccs=40, size=4, size_std=2, ignore_pair=True)
>>> edges = list(infr.ranker.predict_candidate_edges(infr.aids, K=100))
>>> scores = np.array(infr.verifier.predict_edges(edges))
>>> #sortx = util.shuffle(np.arange(len(edges)), rng=321)
>>> sortx = scores.argsort()[::-1]
>>> edges = list(ub.take(edges, sortx))
>>> scores = scores[sortx]
>>> ys = infr.match_state_df(edges)[POSTV].values
>>> y_remainsum = ys[::-1].cumsum()[::-1]
>>> refresh = RefreshCriteria(window=250)
>>> ma1 = []
>>> ma2 = []
>>> reals = []
>>> xdata = []
>>> for count, (edge, y) in enumerate(zip(edges, ys)):
>>>     refresh.add(y, user_id='user:oracle')
>>>     ma1.append(refresh._ewma)
>>>     ma2.append(refresh.pos_frac)
>>>     n_real = y_remainsum[count] / (len(edges) - count)
>>>     reals.append(n_real)
>>>     xdata.append(count + 1)
>>> # xdoctest: +REQUIRES(--show)
>>> from graphid import util
>>> util.qtensure()
>>> util.multi_plot(xdata, [ma1, ma2, reals], marker='',
>>>               label_list=['exp', 'win', 'real'], xlabel='review num',
>>>               ylabel='mu')
_images/fig_graphid_core_refresh_RefreshCriteria_ave_002.jpeg
property pos_frac
graphid.core.refresh.demo_refresh()[source]

CommandLine

python -m graphid.core.refresh demo_refresh \
        --num_pccs=40 --size=2 --show

Example

>>> # ENABLE_DOCTEST
>>> from graphid.core.refresh import *  # NOQA
>>> demo_refresh()
>>> util.show_if_requested()
graphid.core.refresh._dev_iters_until_threshold()[source]

INTERACTIVE DEVELOPMENT FUNCTION

How many iterations of ewma until you hit the poisson / biniomal threshold

This establishes a principled way to choose the threshold for the refresh criterion in my thesis. There are paramters — moving parts — that we need to work with: a the patience, s the span, and mu our ewma.

s is a span paramter indicating how far we look back.

mu is the average number of label-changing reviews in roughly the last s manual decisions.

These numbers are used to estimate the probability that any of the next a manual decisions will be label-chanigng. When that probability falls below a threshold we terminate. The goal is to choose a, s, and the threshold t, such that the probability will fall below the threshold after a maximum of a consecutive non-label-chaning reviews. IE we want to tie the patience paramter (how far we look ahead) to how far we actually are willing to go.