Original Paper - https://arxiv.org/pdf/1709.00103.pdf
Abstract (from the paper)
Relational databases store a significant amount of the worlds data. However, accessing this data currently requires users to understand a query language such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model uses rewards from in-the-loop query execution over the database to learn a policy to generate the query, which contains unordered parts that are less suitable for optimization via cross entropy loss. Moreover, Seq2SQL leverages the structure of SQL to prune the space of generated queries and significantly simplify the generation problem. In addition to the model, we release WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia that is an order of magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL, Seq2SQL outperforms a state-of-the-art semantic parser, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.
Some related work
- Semantic parsing, a question and answering system build over knowledge graph
- Natural language interface for databases. One prominent works in natural language interfaces is PRECISE (Popescu et al., 2003)
- Representation learning for sequence generation. Dong & Lapata (2016)’s attentional sequence to sequence neural semantic parser, which we use as the baseline, achieves state-of-the-art results on a variety of semantic parsing datasets despite not utilising hand-engineered grammar.
Augmented pointer network architecture is used to generate the query from the input. The input to the model is the question and the table content (select , count , etc) and the output would be Sql query.
Using reinforcement learning, they achieved 59.4% test accuracy. This technique benchmarked better than other implementations they tried.
All the models were built using PyTorch for training the WHERE clause, they used "teacher forcing" (i.e. the policy is not learned from scratch) technique and later implement re-enforcement learning.
Some of the question & answer datasets hand marked.
- WikiSQL dataset (github link)
- Towards a Theory of Natural Language Interfaces to Databases (PDF link), Popescu, Etzioni, Kautz, University of Washington.