Image of GoogleSQL: LLM-Friendly SQL Enabled by Pipe Syntax
( SQL )

GoogleSQL: LLM-Friendly SQL Enabled by Pipe Syntax

Text by Ryota Sasazawa

Published

GoogleSQL's pipe syntax is a new syntax that allows you to write SQL in a more intuitive and human-readable format, and it is also friendly to LLMs. This article explains how the pipe syntax can be used to improve NL2SQL, and how it can be used to improve NL2SQL.
Table of Contents

Why Does NL2SQL Even Matter

Over the past few years, Natural Language Processing (NLP) has made huge leaps forward. In particular, the rise of Large Language Models (LLMs) has had a transformative effect on a variety of fields. One such field is NL2SQL (Natural Language to SQL), which translates plain language queries into SQL. This approach allows people who aren’t database experts to interact with data using everyday language, potentially opening the door for more widespread data usage.

However, the traditional NL2SQL approach has run into several obstacles. Issues like SQL’s complexity, layered database schema structures, and the inherent ambiguity of natural language all pose significant hurdles to real-world implementation. Generating SQL with multiple JOINs and subqueries is especially difficult for LLMs, leading to a drop in accuracy.

Against this backdrop, GoogleSQL’s pipe syntax offers a new perspective for NL2SQL. Introduced by Google, the pipe syntax provides a more human-friendly way of writing SQL and a structure that’s easier for LLMs to interpret. In this article, I’ll discuss how pipe syntax—positioned as an “LLM-friendly SQL”—could help address NL2SQL’s challenges, and I’ll walk through the mechanics and possibilities it brings to the table.

Exploring GoogleSQL’s Pipe Syntax

1. Challenges of Conventional SQL and Difficulties for LLMs

SQL is declarative, and experienced engineers have usually found it relatively straightforward. That said, LLMs still struggle with how SQL is structured. Below are some common issues that can trip LLMs up when generating SQL:

  • Complexity of Nested Structures: When subqueries and JOINs pile up, SQL queries start nesting, making them increasingly complicated. LLMs often have trouble picking up on these multilayered relationships and generating correct SQL.
  • Clause Order Constraints: SQL clauses—SELECT, FROM, WHERE, GROUP BY, and so on—have a fixed order. This strict arrangement can clash with the natural word order in human language, which makes it tough for LLMs to produce valid SQL statements.
  • Database Schema Knowledge: For LLMs to generate SQL, they need a complete understanding of table structure, column names, and other schema details. Handling large amounts of external schema information is another challenge that can lead to mistakes.

These points are clearly highlighted in recent studies. A paper called “NL2SQL is a solved problem... Not!” shows that NL2SQL solutions relying on LLMs suffer a noticeable dip in accuracy with complex queries and larger schemas. Specifically, the paper demonstrates the difficulty in producing SQL with multiple JOINs, various aggregations and groupings, and subqueries.

2. Overview of GoogleSQL Pipe Syntax

GoogleSQL’s pipe syntax is a new approach designed to tackle the common pain points of conventional SQL. Its key feature is the way it lays out each processing step to match the flow of data. In standard SQL, the processing order can feel reversed: for example, the final output (SELECT) is listed before you even see where the data is coming from (FROM), and filters (WHERE) may appear quite far from the data source.

The pipe syntax, in contrast, uses the |> operator—similar to the concept of UNIX pipes—to pass data from one step to the next. This ensures the statement’s structure follows the actual data flow and makes everything a bit easier to read.

Here’s an example comparing standard SQL with pipe syntax:

-- Standard SQL
SELECT c_count, COUNT(*) AS custdist
FROM(
    SELECT c_custkey, COUNT(o_orderkey) c_count
    FROM customer
    LEFT OUTER JOIN orders ON c_custkey = o_custkey
    AND o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
-- Pipe Syntax
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
   AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

With pipe syntax, it’s more obvious how the data flows: it starts in the customer table, then moves through a LEFT OUTER JOIN with orders, then through two separate AGGREGATE steps, and finally an ORDER BY. Each step is connected with |>, creating a more natural, linear progression.

3. Why Pipe Syntax is LLM-Friendly

Below are three major reasons why LLMs handle pipe syntax more easily:

  1. Linear Structure: By chaining data operations together, the SQL becomes more linear overall. LLMs excel with sequential data, so the straightforward “pipeline” approach is simpler for them to parse and produce.

  2. Flexibility in Clause Order: Pipe syntax is less rigid about clause ordering than traditional SQL. For example, you can apply filters at various points in the pipeline rather than sticking to a strict WHERE clause position. This flexibility makes it possible for an LLM to generate queries that mirror natural language more closely.

  3. Step-by-Step Application of Schema Information: In pipe syntax, each pipeline stage only needs relevant schema info for that step. During a JOIN, for instance, only the columns for the joined tables are required. This method keeps the LLM from being overwhelmed by the entire schema at once and makes SQL generation more efficient.

Together, these traits address the main roadblocks mentioned in “NL2SQL is a solved problem... Not!”: nested structures, clause ordering constraints, and the complexity of fully integrating database schema knowledge.

4. Examples of NL2SQL Improvement Using Pipe Syntax

Let’s walk through some concrete examples that illustrate how pipe syntax can boost NL2SQL accuracy.

Example 1: Handling JOINs

Natural language query: “Show customers and their order counts, but only for orders where the comment doesn’t contain ‘unusual packages.’”

In standard SQL, you’d have a mix of JOINs and WHERE clauses that could be confusing. An LLM might jumble the join conditions or the filter. Using pipe syntax:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
   AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) AS order_count
|> SELECT c_custkey, order_count;

Here, the flow is crystal clear: you start from customer, join with orders based on matching keys and the ‘comment’ filter, then aggregate. By keeping the join conditions and filters together, you reduce the chance of errors.

Example 2: Aggregation and Grouping

Natural language query: “Group customers by the number of orders they have, and sort these groups in descending order.”

The typical SQL might need subqueries, which can get bulky. With pipe syntax:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) AS order_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS customer_count
GROUP BY order_count
|> ORDER BY customer_count DESC;

The pipeline spells out every step: fetch data, join, count orders, group by c_custkey, then count how many customers fall into each group, and finally sort. Since each step is laid out in sequence, it’s less likely for the LLM to mix things up.

5. Potential for Further Improvement and Future Outlook

Despite the clear advantages, the pipe syntax still poses certain challenges:

  • Complex Conditional Branching: Pipe syntax fundamentally pushes data in one direction. Generating SQL that branches in more complicated ways might require workarounds.
  • Improving LLM Understanding: LLMs need a deeper understanding of what each pipeline stage does, so they can account for context and produce even more accurate SQL.

Making SQL More Widely Accessible

In this piece, I’ve looked at GoogleSQL’s pipe syntax as a way to tackle some of the core challenges in NL2SQL, including tricky nested structures, clause ordering, and schema handling. Thanks to its linear layout, flexible clause order, and the incremental use of schema details, pipe syntax can reduce the common pitfalls that LLMs encounter when trying to parse or generate SQL. Ideally, that means better NL2SQL performance and a broader audience for database interactions through plain language.

As we keep pushing the boundaries of making natural language and databases work seamlessly together, Liam ERD will continue exploring ways to encourage collaborative interactions between users and LLMs, especially around understanding and building out database schemas. If you’ve got ideas or suggestions, head over to our Discussions and let’s talk.

References

Text byRyota Sasazawa

Ryota Sasazawa is a software engineer at ROUTE06, Inc., working on Liam. Specializing in full-stack development, cloud infrastructure, and SRE, he builds scalable systems and improves site reliability.

Last edited on