Economic Utilization of Workforce-Based Labeling for Security Applications

Tomer Geva; Maytal Saar-Tsechansky (U. Texas Austin)

Researcher

Supervised learning is a key technology for handling security threats by capturing patterns in historical data that are characteristic of threats, and then detecting these patterns in the future.  Supervised learning has been successfully applied to a myriad of important security applications including inappropriate content filtering, intrusion detection, video-surveillance-based intention detection, internet bullying detection, and online fraud detection, among other tasks. Recently, online marketplaces for human intelligence tasks, such as Amazon’s Mechanical Turk, have presented exciting opportunities for using human intelligence to enhance or complement data-driven learning algorithms. However, achieving these benefits is non-trivial. For this promise to materialize, it is imperative to characterize and address a myriad of new challenges presented by these marketplaces.

This research aims to be the first to provide a comprehensive information acquisition policy for human labeling markets towards security modeling tasks. As such, it aims to produce a novel labeling acquisition to maximize modeling performance for a given budget and time constraints. To accommodate human labeling markets the policies we aim to develop will also consider the labeling task assignment, the capacity to acquire multiple labels for a single data instance, the pay offered per label, as well as effective incentives and labeler screening mechanisms. Towards that we first aim to develop a novel framework that accommodates the rich set of components of this problem. We also aim to programmatically implement the proposed solution and empirically evaluate its performance in real-world settings over different kinds of human intelligence marketplaces and settings.  Empirical evaluations will involve both simulation and live experiments, deploying online platforms for both non-expert and expert workers. The suggested framework includes the development of several modules to address the challenges above. Specifically, the modules include:

  1. Continuous evaluation of a labeler/worker quality. This module will estimate the expected tradeoffs function for payment/quality and for payment/time.
  2. A data-driven learning module responsible for continuous re-training of supervised learning algorithms.
  3. Performance evaluation model – evaluating performance over an independent test set.
  4. A “policy” for selecting informative training instances for labeling that would be used for model learning, and the assignment of pay offered per labeled instance. This module will also decide whether there is a need for multiple labeling per instance, and whether or not to invoke screening questions to screen potential labelers. This module will interact with modules (a)-(c). 
  5. A real-time front end web interface to interact with the online workers and allocate the relevant data instances for labeling.

We expect that this research agenda would yield several research contributions. Specifically, the expected contributions include opening the data-acquisition bottleneck in machine learning security applications, and algorithmic research towards maximizing security performance under budget and time constraints.

Tel Aviv University, P.O. Box 39040, Tel Aviv 6997801, Israel
UI/UX Basch_Interactive