CONVAID’s Evolution in Natural Language to SQL: Exploring Vanna, MAC-SQL, and PET-SQL Frameworks 

blogs Banners (3)

Introduction 

When I started working on CONVAID, a system designed to convert natural language into SQL queries, I realized that building an accurate and efficient pipeline required more than just using a large language model. The task involved understanding the intricacies of database schemas, improving query precision, and handling a variety of real-world queries. Over time, I explored multiple frameworks—Vanna, MAC-SQL, and finally PET-SQL—to refine the process and improve accuracy. This blog reflects my journey and the lessons I learned while enhancing CONVAID’s SQL generation capabilities. 

Starting with Vanna Framework 

My journey began with integrating the Vanna Framework, a Python package that leverages Retrieval-Augmented Generation (RAG) to generate SQL queries from natural language. The framework was promising, offering a way to incorporate user feedback and progressively improve query generation through reinforcement learning. 

To make Vanna work effectively for CONVAID, I had to fine-tune it using data from multiple sources: 

  • Data Definition Language (DDL): I structured and uploaded DDL files to define the database schema.
  • Table Documentation: This gave the model detailed insights into table content, helping it understand context better.
  • Question-SQL Pairs: These pairs served as training data, allowing the model to recognize query patterns. 

While Vanna provided a solid foundation, I quickly realized its limitations. It required significant manual review and validation, and its accuracy plateaued at around 66% despite numerous refinements. This pushed me to look for a more robust solution. 

Enhancing with MAC-SQL Framework 

To address Vanna’s shortcomings, I transitioned to the MAC-SQL Framework. MAC-SQL introduced specialized agents—Selector, Decomposer, Desemanticization, and Refiner—to break down and refine complex queries. This approach significantly boosted CONVAID’s ability to handle diverse and domain-specific queries. 

Here’s how I improved the pipeline using MAC-SQL: 

  • Selector Agent: Retrieved the most relevant question-SQL pairs, improving the model’s contextual understanding. 
  • Decomposer Agent: Broke down complex queries into manageable components, making them easier for the model to parse. 
  • Refiner Agent: Iteratively corrected SQL queries, ensuring syntactic accuracy and reducing errors. 

Integrating MAC-SQL resulted in a notable improvement, raising the accuracy to 68%. However, the additional LLM calls increased processing time and computational costs, which became a concern as the query volume grew. 

Reaching New Heights with PET-SQL Framework 

Determined to optimize both accuracy and efficiency, I turned to the PET-SQL Framework—a more sophisticated two-stage system that refined query conversion using advanced masking techniques and improved in-context example retrieval. PET-SQL was a game-changer for CONVAID. 

Here’s what made PET-SQL stand out: 

  • Desemanticization Layer: Extracted and masked unnecessary information to prevent irrelevant context from interfering with query generation. 
  • Enhanced Schema Linking: Improved entity alignment between natural language queries and SQL schema. 
  • In-Context Demonstrations: Provided high-quality examples to the model, allowing it to refine its understanding during inference. 

This refined pipeline pushed CONVAID’s accuracy to 86%, marking a significant milestone in my journey. The improvements in accuracy, coupled with reduced processing overhead, made PET-SQL the most effective solution for our SQL generation task. 

Testing and Validation 

Throughout this journey, I conducted extensive testing to validate each framework’s performance. I built a comprehensive test suite that evaluated the generated SQL queries on multiple dimensions, including syntax correctness, execution success, and logical accuracy. Comparing these results gave me a clear picture of how each framework contributed to improving CONVAID’s performance. 

Lessons Learned and Looking Forward 

Working through these frameworks taught me that enhancing a natural language to SQL system isn’t just about improving model architecture. It requires: 

  1. Domain Adaptation: Tailoring models to specific databases by leveraging domain knowledge.  
  2. Reinforcement through Feedback: Continuously refining query outputs with user feedback. 
  3. Cost-Accuracy Balance: Managing trade-offs between accuracy and computational cost in high-volume environments.  

    The journey from Vanna to PET-SQL has been challenging yet rewarding, and I’m eager to continue pushing the boundaries of what CONVAID can achieve. 

    Leave a Comment

    Your email address will not be published. Required fields are marked *