Psuedo-Data Injections for CLO Bandit Problems

Published:

Contextual linear optimization (CLO) with bandit feedback is a class of CLO problems where only the costs of historical actions are observable. Finding an optimal decision making policy in this setting suffers from the fundamental challenge that real-world data often lacks coverage over the action space, making the full cost vector unidentifiable with the data available. A common remedy is to apply regularization to ensures stability of the learning problem. We show that this approach admits an alternative interpretation as a specific form of pseudo-data injection where synthetic data is added to induce coverage. This perspective suggests a broader question regarding how arbitrary pseudo-data can be injected when prior beliefs about the environment or data collection process are available. We propose two methods of pseudo-data injection that reflect structured beliefs about the underlying cost distribution or the data collection process, and show that regularization is a special case.

Download Paper