SQL风格指南

您好!我是Matt Mazur,是一名数据分析师,曾在多家初创公司工作,帮助他们利用数据发展业务。本指南试图记录我对格式化SQL的偏好,以期对其他人有所帮助。如果您或您的团队还没有SQL样式指南,那么这可能是一个很好的起点,您可以根据自己的喜好采用和更新。

另外,我是一个意见偏激者,所以如果您不同意任何一点,给我一个便条,我很乐意去探讨。

如果您对这个主题感兴趣,您可能同样会对我的Matt on Analytics新闻稿件和我写的关于分析学和数据分析的博客感兴趣。

示例

下面是一个非常重要的查询,让您能够了解这个风格指南在实践中是什么样子的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
with hubspot_interest as (

select
email,
timestamp_millis(property_beacon_interest) as expressed_interest_at
from hubspot.contact
where property_beacon_interest is not null

),

support_interest as (

select
email,
created_at as expressed_interest_at
from helpscout.conversation
inner join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id
where tag = 'beacon-interest'

),

combined_interest as (

select * from hubspot_interest
union all
select * from support_interest

),

final as (

select
email,
min(expressed_interest_at) as expressed_interest_at
from combined_interest
group by email

)

select * from final

准则

使用小写SQL

它和大写的SQL一样可读,您不必一直按住SHIFT键。

1
2
3
4
5
6
7
8
-- Good
select * from users

-- Bad
SELECT * FROM users

-- Bad
Select * From users

单行查询与多行查询

当你进行以下查询时才可以将所有SQL放到同一行:

  • 所有列(*)或选择1或2列
  • 而且在您的查询中没有额外的复杂性
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    -- Good
    select * from users

    -- Good
    select id from users

    -- Good
    select id, email from users

    -- Good
    select count(*) from users

原因很简单,当所有东西都在一条线上时,仍然很容易阅读。但是,一旦您开始添加更多的列或更复杂的列,如果它位于多行上,则更容易阅读:

1
2
3
4
5
6
7
8
9
10
11
-- Good
select
id,
email,
created_at
from users

-- Good
select *
from users
where email = 'example@domain.com'

对于具有1列或2列的查询,可以将列放在同一行上。对于3+列,将每个列名称放在自己的行上,包括第一项:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Good
select id, email
from users
where email like '%@gmail.com'

-- Good
select user_id, count(*) as total_charges
from charges
group by user_id

-- Good
select
id,
email,
created_at
from users

-- Bad
select id, email, created_at
from users

-- Bad
select id,
email
from users

所有内容左对齐

1
2
3
4
5
6
7
8
9
-- Good
select id, email
from users
where email like '%@gmail.com'

-- Bad
select id, email
from users
where email like '%@gmail.com'

使用单引号

一些SQL方言(如BigQuery)支持使用双引号,但对于大多数方言来说,双引号最终会引用列名。因此,最好使用单引号:

1
2
3
4
5
6
7
8
9
-- Good
select *
from users
where email = 'example@domain.com'

-- Bad
select *
from users
where email = "example@domain.com"

使用!=而不是<>

只是因为!=读起来像“不等于”,这更接近于我们怎么大声说出来。

1
2
3
4
-- Good
select count(*) as paying_users_count
from users
where plan_name != 'free'

逗号应在行尾

1
2
3
4
5
6
7
8
9
10
11
-- Good
select
id,
email
from users

-- Bad
select
id
, email
from users

缩进Where条件

当只有一个Where条件时,请将其与where保持在同一行:

1
2
3
select email
from users
where id = 1234

如果存在多个条件,则将每个级别缩进到比where更深的一个级别。将逻辑运算符放在前一个条件的末尾:

1
2
3
4
5
select id, email
from users
where
created_at >= '2019-03-01' and
vertical = 'work'

避免括号内有空格

1
2
3
4
5
6
7
8
9
-- Good
select *
from users
where id in (1, 2)

-- Bad
select *
from users
where id in ( 1, 2 )

in查询值的长列表分隔为多个缩进行

1
2
3
4
5
6
7
8
9
-- Good
select *
from users
where email in (
'user-1@example.com',
'user-2@example.com',
'user-3@example.com',
'user-4@example.com'
)

表名应该是蛇形连接复数形式

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
id,
email,
timestamp_trunc(created_at, month) as signup_month
from users

-- Bad
select
id,
email,
timestamp_trunc(created_at, month) as SignupMonth
from users

列名约定

  • 布尔型字段应以is_has_does_开头,如:is_customer, has_unsubscribed等。
  • 日期型字段应以_date结尾,如:report_date。
  • 日期时间型字段应以_at结尾,如:created_at, posted_at等。

列顺序约定

首先放置主键,接着是外键,然后放置所有其他列。如果表中有任何系统列(created_at, updated_at, is_deleted等),则放到最后。

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
id,
name,
created_at
from users

-- Bad
select
created_at,
name,
id,
from users

内连接中包含inner

最好是显式的,以便连接类型清晰明了:

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
email,
sum(amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

-- Bad
select
email,
sum(amount) as total_revenue
from users
join charges on users.id = charges.user_id

对于联接条件,将首先引用的表放在紧邻on之后

这样做可以更容易地确定连接是否会导致结果分散:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Good
select
...
from users
left join charges on users.id = charges.user_id
-- primary_key = foreign_key --> one-to-many --> fanout

select
...
from charges
left join users on charges.user_id = users.id
-- foreign_key = primary_key --> many-to-one --> no fanout

-- Bad
select
...
from users
left join charges on charges.user_id = users.id

单个联接条件应与join在同一行上

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Good
select
email,
sum(amount) as total_revenue
from users
inner join charges on users.id = charges.user_id
group by email

-- Bad
select
email,
sum(amount) as total_revenue
from users
inner join charges
on users.id = charges.user_id
group by email

当有多个连接条件时,将每个条件放置在各自的缩进行上:

1
2
3
4
5
6
7
8
9
-- Good
select
email,
sum(amount) as total_revenue
from users
inner join charges on
users.id = charges.user_id and
refunded = false
group by email

避免表别名

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
email,
sum(amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

-- Bad
select
email,
sum(amount) as total_revenue
from users u
inner join charges c on u.id = c.user_id

唯一的例外是,当您需要多次联接到一个表上并需要区分它们时。

如非必须不要包含表名

1
2
3
4
5
6
7
8
9
10
11
-- Good
select
id,
name
from companies

-- Bad
select
companies.id,
companies.name
from companies

总是重命名聚合和函数包装的参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Good
select count(*) as total_users
from users

-- Bad
select count(*)
from users

-- Good
select timestamp_millis(property_beacon_interest) as expressed_interest_at
from hubspot.contact
where property_beacon_interest is not null

-- Bad
select timestamp_millis(property_beacon_interest)
from hubspot.contact
where property_beacon_interest is not null

确保布尔条件是显示的

1
2
3
4
5
6
7
-- Good
select * from customers where is_cancelled = true
select * from customers where is_cancelled = false

-- Bad
select * from customers where is_cancelled
select * from customers where not is_cancelled

使用as定义列别名

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
id,
email,
timestamp_trunc(created_at, month) as signup_month
from users

-- Bad
select
id,
email,
timestamp_trunc(created_at, month) signup_month
from users

用列名而不是序号做分组

1
2
3
4
5
6
7
8
9
10
11
-- Good
select user_id, count(*) as total_charges
from charges
group by user_id

-- Bad
select
user_id,
count(*) as total_charges
from charges
group by 1

使用列别名

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
timestamp_trunc(com_created_at, year) as signup_year,
count(*) as total_companies
from companies
group by signup_year

-- Bad
select
timestamp_trunc(com_created_at, year) as signup_year,
count(*) as total_companies
from companies
group by timestamp_trunc(com_created_at, year)

分组列应置于最前

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Good
select
timestamp_trunc(com_created_at, year) as signup_year,
count(*) as total_companies
from companies
group by signup_year

-- Bad
select
count(*) as total_companies,
timestamp_trunc(com_created_at, year) as signup_year
from mysql_helpscout.helpscout_companies
group by signup_year

对齐case/when语句

每个when都应该在自己的行上(case所在行没有任何内容),并且应该缩进比case行深一层。then可以和when在同一行上或者在下一行,这样做只是为了保持一致。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Good
select
case
when event_name = 'viewed_homepage' then 'Homepage'
when event_name = 'viewed_editor' then 'Editor'
end as page_name
from events

-- Good too
select
case
when event_name = 'viewed_homepage'
then 'Homepage'
when event_name = 'viewed_editor'
then 'Editor'
end as page_name
from events

-- Bad
select
case when event_name = 'viewed_homepage' then 'Homepage'
when event_name = 'viewed_editor' then 'Editor'
end as page_name
from events

使用通CTEs而非子查询

避免子查询,CTE(通用表表达式)将使您的查询更容易阅读和解释。
使用CTE时,请用新行填充查询。
如果使用任何CTE,请始终在末尾使用名为final的CTE,并选择* from final。这样,您就可以快速检查查询中用于调试结果的其他CTE的输出。
关闭CTE括号应使用与WITH和CTE名称相同的缩进级别。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Good
with ordered_details as (

select
user_id,
name,
row_number() over (partition by user_id order by date_updated desc) as details_rank
from billingdaddy.billing_stored_details

),

final as (

select user_id, name
from ordered_details
where details_rank = 1

)

select * from final

-- Bad
select user_id, name
from (
select
user_id,
name,
row_number() over (partition by user_id order by date_updated desc) as details_rank
from billingdaddy.billing_stored_details
) ranked
where details_rank = 1

使用有意义的CTE名称

1
2
3
4
5
-- Good
with ordered_details as (

-- Bad
with d1 as (

窗口函数

您可以将其全部保留在自己的行中,也可以根据其长度将其拆分为多个:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Good
select
user_id,
name,
row_number() over (partition by user_id order by date_updated desc) as details_rank
from billingdaddy.billing_stored_details

-- Good
select
user_id,
name,
row_number() over (
partition by user_id
order by date_updated desc
) as details_rank
from billingdaddy.billing_stored_details

感谢

本风格指南灵感来源于:

向Peter Butler、Dan Wyman、Simon Ouderkirk、Alex Cano、Adam Stone、Brian Kim和Claire Carroll提供关于本指南的反馈意见表示感谢。


0%