Introducing Daft-SQL

A SQL API enabling users to interact with their data in a new but familiar way!

Oct 23, 2024

What is Daft-SQL?

Our new Daft-SQL API is designed to let users interact with their data using traditional SQL queries within the daft ecosystem. Whether you’re analyzing large datasets or performing simple queries, Daft-SQL offers a streamlined, SQL-based interface that simplifies your workflow.

Our SQL dialect is based on PostgreSQL, one of the most widely used dialects in the industry. This means that if you’re already familiar with PostgreSQL, you'll feel right at home using Daft-SQL. Common SQL commands and functions, such as SELECT, WHERE, and JOIN, behave as you’d expect, ensuring a low learning curve for those accustomed to SQL.

Why Use Daft-SQL?

For many users, learning a new DataFrame API can be challenging—and being forced to do so can be frustrating, especially when you’re already proficient in SQL. Daft-SQL bridges this gap by allowing you to query your data with SQL commands, eliminating the need to learn new DataFrame-specific methods and syntax. This makes it particularly appealing for SQL professionals and data analysts who want to leverage the power of daft without having to dive into the intricacies of a new API.

Daft-SQL is built on the same query execution engine as the native daft DataFrame API, so there are no performance trade-offs between using SQL and using the DataFrame methods. You get the same speed and efficiency both ways. This allows you to choose the interface that works best for you, without worrying about sacrificing performance.

How to Use Daft-SQL

Using Daft-SQL is simple. It’s already integrated into the daft library, so no additional setup is required. Here’s a basic example to get you started:

import daft
df = daft.read_csv('data.csv')

# Run a SQL query
daft.sql('SELECT * FROM df').collect()

# Alternatively, you can manually construct a `SQLCatalog` for more control over the SQL context
from daft.sql import SQLCatalog

# Create a SQLCatalog
catalog = SQLCatalog({'table_1': df})

daft.sql('SELECT * FROM table_1', catalog).collect()

In this example, we import the necessary components from daft, create a DataFrame by reading a CSV file, and then run a simple SQL query to select all records from the DataFrame. Daft-SQL supports all the usual SQL operations, making it extremely flexible for a variety of use cases. Whether you need to filter data, aggregate results, or join tables, Daft-SQL has you covered!

Tips and Tricks

We’ve designed Daft-SQL to stay as close to our DataFrame API as possible. However, there are a few key differences in function names that users should be aware of.

In Daft-SQL, all .str namespaced functions from the DataFrame API are available globally, and we use snake_case for function names to maintain consistency. This ensures that functions are easy to understand and use, while also aligning with SQL conventions.

Some notable differences include:

str.endswith is written as ends_with in Daft-SQL.
str.startswith is written as starts_with in Daft-SQL.

Aside from these small variations, most functions have a direct mapping between the DataFrame API and the SQL API, so switching between the two is straightforward.

Here’s an example to illustrate this:

# DataFrame API
df.select(
  col('name').str.endswith('x'),
  col('lst').list.count(),
  col('img').image.decode()
)

# SQL API
daft.sql("""
  SELECT 
    ends_with(name, 'x'),
    list_count(lst),
    image_decode(img)
  FROM df
""")

In this example, you can see how similar the functionality is across both APIs. This makes it easy to switch between the two depending on your preferred style or the task at hand.

Furthermore, certain features, like aliasing columns and conditional logic, are fully supported using standard SQL syntax. For instance:

# DataFrame API
df.select(col('name').alias('my_alias'))
df.select((df["A"] > df["B"]).if_else(df["A"], df["B"]))

# SQL API
daft.sql('SELECT name AS my_alias FROM df')
daft.sql('SELECT CASE WHEN A > B THEN A ELSE B END FROM df')

With Daft-SQL, we’re excited to offer a tool that brings together the power of SQL and the flexibility of daft. Happy querying!

Need Help?

We’re here to support you as you explore the full potential of Daft-SQL → See Daft-SQL Documentation

If you run into any issues or have questions, feel free to reach out to us via our Distributed Data Slack Community or open an issue on our Daft GitHub. We’re always looking for feedback to make Daft-SQL even better, so don’t hesitate to share your thoughts and experiences!

A guest post by

Cory Grinstead

Daft Engineering Blog

Discussion about this post