by V. Raychev, P. Bielik, M. Vechev, A. Krause

Abstract:

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings with noise: the setting where the dataset has a bound on the noise, and the setting where no bound on the noise is known. We then present new kinds of program synthesizers which target particular noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of in- correct input/output examples. We show that the synthesizer can detect errors in the examples while combating overfitting – a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as ``Big Code''. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions act as a natural mechanism for explaining the prediction to the programmer). As a result, our code completion system surpasses the prediction capabilities of state of the art, hard-wired systems.

Reference:

Learning Programs from Noisy Data V. Raychev, P. Bielik, M. Vechev, A. KrauseIn Proc. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2016

Bibtex Entry:

@inproceedings{raychev16learning,
	author = {Veselin Raychev and Pavol Bielik and Martin Vechev and Andreas Krause},
	booktitle = {Proc. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL)},
	month = {January},
	title = {Learning Programs from Noisy Data},
	year = {2016}}