SELECT Queries
Represents a single SELECT
query. Functionality is the same as Trino SQL. Any caveats are listed explicitly
SELECT
SELECT *
is not supported right now.- Aliases are mandatory in case of conflicts, and qualified references to columns.
- They are also mandatory for function calls.
SELECT SUM(a) AS b
is a correct query, whileSELECT SUM(a)
is not. - If referring to a column from a joined table, the column will include the qualifier too based on the joined table name or alias. It is recommended to use an appropriate alias in this case as well. For example,
SELECT a.name as name
orSELECT table_1.user_id as user_id
- They are also mandatory for function calls.
DISTINCT
andCOUNT(*)
will work as expected- While processing the distinct values of a column, all the NULLs are considered as one distinct value
FROM
- Please use the table name.
- Subqueries will work here. Aliases are mandatory for subqueries
WHERE
- Conditions, including nested brackets, operator overloading and functions should work as expected
- Subqueries should work here and aliases are NOT mandatory
GROUP BY
- While grouping on a column, all the NULLs are put together in one bucket
- GROUP BY with ordinal references to SELECT clauses are supported.
ORDER BY
- It is mandatory to specify LIMIT or OFFSET Clause if ORDER BY is used inside a subquery or CTE
LIMIT
The LIMIT
clause restricts the number of rows in the result set.
HAVING
- Having works when a group by has been configured
JOIN
- All join clauses are supported, including the different types of joins
- Subqueries will work here but aliases are mandatory
- When an alias is specified for a table name, then:
- If the column name is unique, the column can either directly be referenced or the alias must be used as the qualifier, not the table name.
- For example, the following query is accurate: table_1 contains columns a, b, c and table_2 contains columns a, b, d
select t1.a as a, t2.b as b, t1.c as c, d
from
table_1 t1
inner join
table_2 t2
on t1.a = t2.a - For example, the following query is NOT accurate: table*1 contains columns a, b, c and table_2 contains columns a, b, d
select t1.a as a, t2.b as b, table_1.c as c, d - - \_table_1.c is not valid*
from
table_1 t1
inner join
table_2 t2
on t1.a = t2.a
- For example, the following query is accurate: table_1 contains columns a, b, c and table_2 contains columns a, b, d
- If a column is unique, it can be referenced directly or using the table name qualifier (for example, column d is unique but it is referenced as either d or t1.d). In the case the table name alias is used as a qualifier, an alias for the column is mandatory
- If a column name is not unique, the table alias must be used as a qualifier, not the table name. In this case, an alias for the column is mandatory
- If the column name is unique, the column can either directly be referenced or the alias must be used as the qualifier, not the table name.
Null Handlings
Similar to <=>
operator in spark, Trino has IS DISTINCT FROM
and IS NOT DISTINCT FROM
operators. In SQL a NULL
value signifies an unknown value, so any comparison involving a NULL
will produce NULL
. These two operators treat NULL
as a known value and both guarantee either a true or false outcome even in the presence of NULL
input
SELECT * FROM memory.test_schema.dummy_table; -- Table 1
->
id | name | created_at
----+---------+-------------------------
1 | A | 2024-11-01 10:00:00.000
2 | NULL | 2024-11-02 11:30:00.000
3 | C | NULL
4 | NULL | NULL
SELECT * FROM memory.test_schema.dummy_table_2; -- Table 2
->
id | name | created_at | status | score
----+---------+-------------------------+----------+-------
1 | A | 2024-11-01 10:00:00.000 | active | 85.5
2 | NULL | 2024-11-02 11:30:00.000 | inactive | NULL
3 | C | NULL | NULL | 70.0
4 | NULL | NULL | active | NULL
5 | E | 2024-11-04 09:00:00.000 | NULL | 90.0
IS NOT DISTINCT FROM
VS =
Consider the following query which uses IS NOT DISTINCT FROM
in the ON clause. There are null values in both name and created_at columns of both the tables.
SELECT a.id as a_id, b.id as b_id, a.name as a_name, b.name as b_name,
a.created_at as a_created_at, b.created_at as b_created_at, status, score
FROM memory.test_schema.dummy_table a
LEFT JOIN memory.test_schema.dummy_table_2 b
ON a.name IS NOT DISTINCT FROM b.name
AND a.created_at IS NOT DISTINCT FROM b.created_at;
->
a_id | b_id | a_name | b_name | a_created_at | b_created_at | status | score
------+------+---------+---------+-------------------------+-------------------------+----------+-------
1 | 1 | A | A | 2024-11-01 10:00:00.000 | 2024-11-01 10:00:00.000 | active | 85.5
2 | 2 | NULL | NULL | 2024-11-02 11:30:00.000 | 2024-11-02 11:30:00.000 | inactive | NULL
3 | 3 | C | C | NULL | NULL | NULL | 70.0
4 | 4 | NULL | NULL | NULL | NULL | active | NULL
This behaviour is compatible with spark’s <=>
operator, which allows for NULL Safe comparisons. IS NOT DISTINCT FROM
equates the NULLs from the left and the right tables while joining as opposed to =
operator which when sees NULL in the joining column of the left table produces NULL as output in the columns of the right table.
SELECT a.id as a_id, b.id as b_id, a.name as a_name, b.name as b_name,
a.created_at as a_created_at, b.created_at as b_created_at, status, score
FROM memory.test_schema.dummy_table a
LEFT JOIN memory.test_schema.dummy_table_2 b
ON a.name = b.name AND a.created_at = b.created_at;
->
a_id | b_id | a_name | b_name | a_created_at | b_created_at | status | score
------+------+---------+--------+-------------------------+-------------------------+--------+-------
1 | 1 | A | A | 2024-11-01 10:00:00.000 | 2024-11-01 10:00:00.000 | active | 85.5
2 | NULL | NULL | NULL | 2024-11-02 11:30:00.000 | NULL | NULL | NULL
3 | NULL | C | NULL | NULL | NULL | NULL | NULL
4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL
UNION
- Two
NULL
values are considered equal
WINDOW
- Window functions may have performance issues. They should work, but please ensure that you are testing the performance before committing to a window based approach.
OFFSET
The OFFSET
clause is used to skip a specified number of rows before beginning to return rows from the query.
EXCEPT
- Two
NULL
values are considered equal
INTERSECT
- Two
NULL
values are considered equal
WITH
- CTEs are supported.
- Recursive CTEs are not supported.
- WITH FUNCTION will not work
UNNEST
Only supports one column that can be exploded at a time
CASE
IN / NOT IN
IN
expression can return a TRUE
, FALSE
or NULL
value.
Null Handlings
The NULL semantics in IN
can be deduced from the NULL
value handling in comparison operators=
and logical operators OR
. If the subquery in where clause only returns NULL value, no rows will be selected by the IN
operator even if NULL
value is present in the column in question.
SELECT * FROM memory.test_schema.dummy_table;
id | name | created_at | status | score |
---|---|---|---|---|
1 | A | 2024-11-01 10:00:00 | active | 85.5 |
2 | NULL | 2024-11-02 11:30:00 | inactive | NULL |
3 | C | NULL | NULL | 70.0 |
4 | NULL | NULL | active | NULL |
5 | E | 2024-11-04 09:00:00 | NULL | 90.0 |
Consider the above table.
- The subquery has only
NULL
value in its result set. Therefore, theIN
predicate produces no rows as the result even though name column hasNULL
values
SELECT * FROM memory.test_schema.dummy_table WHERE name in (SELECT NULL);
id | name | created_at |
---|---|---|
- The subquery has
NULL
value in the result set as well as a valid value‘A’
. Rows with name = ‘A’ are returned.
SELECT * FROM memory.test_schema.dummy_table
WHERE name in (SELECT name from (VALUES ('A'), (NULL)) AS sub(name));
id | name | created_at |
---|---|---|
1 | A | 2024-11-01 10:00:00 |
Compound Queries
Containing more than one SELECT
queries.
SubQueries
- SubQueries should work as expected
- Aliases for subqueries are mandatory