SQL

Next: Indexes

The world is full of success stories about standardisation and the benefits of us doing things in the same way, following the same approach. SQL coding is no different. If you have a team of peropl working together but not following any kind of standard or best practices, things get messy and difficult very fast.

Everything from DB table names to code documentation , it all matters. Agreeing a naming convention and style inside your organisation will already make your life so much easier. You will be amazed how much easier it is to collaborate, read code , debug, etc.

Here is a simple list of my own personal best practices.

1). Never use SELECT*

It is never good to select all columns from a table. It is rare that you will actually want all values. You should define the subset of columns you want to have returned. This is especially true for larger tables. Too much unnecessary load on the database engine.

The select can be refined in a number of way. If, for example, your query contains a JOIN and you only need values from a single table, be specific about that. Instead of SELECT * use SELECT table_name.*.

Ideally you only select the columns you need

SELECT table_1.column_name, table_2.column_name

2). Seperate attribute columns to rows

Same as above but give each column its own row

SELECT
table_1.column_name
, table_2.column_name

In addition to this, consider renaming any columns with a name which better reflects its purpose in this context. For eaxmple, a column named date doesn’t tell you very much so maybe it is better to rename it in the query to indicate what date we are talking about – like created_date.

Or you could add a prefix to the column name. Quite often it is enough if the prefix is the table name where the column is. For example, for a column called sign_date from the table called contract, renaming the column in the query to contract_sign_date would be very useful.


3). Set a naming convention and don’t allow any exceptions EVER!

Don’t fool yourself on this one, any exceptions allowed will eventually come back to bite you. The set naming convention should be used across the board.

If, for example, a specific value in your table tables represents a monetary amount, you need to agree how this will be named. Will it be amount or amt, for example. Doesn’t matter what is agreed. It only matters that the same convention is used by everyone going forward. If the team agrees to use amt and one person uses amount, inconsistancy is introduced

It is better to focus on other parts of the SQL design/solution – better to just know “this is the convention – it is always amt“.

Consistency is key

  1. tables & columns (lower-case, snake-form)
  2. keys (suffix _id )
  3. date columns suffix _date
  4. datetime columns suffix _datetime or _dt
  5. indexes (prefix: idx_ , ak_ )
  6. table or column names always always always in singular
  7. flag indication (1=yes, 0=no, -1=unknown) always suffix _flag (TINYINT NOT NULL DEFAULT -1 )
  8. separate integer ID’s and varchar codes
    • suffix _id for INTEGER (11)
    • suffix _code for VARCHAR (35)

4). Be descriptive, don’t use acronyms

Using acronyms almost always cause issues down the line. It leads to mistakes while coding if the acronyms are too similiar. It leads to confusion when new members join the team.

If you use the following types of acronyms……..

  • cnt = count
  • amt = amount
  • catg = category
  • ins = insert

…….. you can expect them to cause some problems later on.

Recommendation: be descriptive and always use full words. A simple rule like this can help a lot.

  • count (i.e. max_use_count )
  • amount (i.e. voucher_amount )
  • category (i.e. hotel_category_id )
  • insert (i.e. insert_user_id )

5). Use audit columns

It is a good idea to include a set of audit columns in ALL tables. It needs to be agreed what audit columns are, and will be, necessary and these need to be added to each and every table WITHOUT EXCEPTION.

  • insert_dt – type DATETIME – time when the row was inserted (use NOW() at the time of insert)
  • insert_user_id – type INT (11) – a user (if logged in) who inserted the row
  • insert_process_code – type VARCHAR (255) – a process, function or class which inserted the row
  • update_dt – type TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP – automatically changed on any change to any column in the row
  • update_user_id – type INT (11) – a user (if logged in) who modified the row
  • update_process_code – type VARCHAR (255) – a process, function or class which inserted the row
  • deleted_flag – type TINYINT (4) NOT NULL DEFAULT 0 – use values 0 or 1 . This is sometimes better than using DELETE. There is tracebility and an audit trail and the data is recoverable if required. Of course any code using the table must be aware of the existence of this audit column.

6). Batch delete & updates

In a large database, DELETE can take a long time and cause table locks and transaction queuing.

DELETE FROM salary
WHERE to_date IS NOT NULL

To make this faster, you could batch your deletes into smaller chunks in order to avoid locking tables. Obviously this means that the delete needs to be run multiple times. You will need to run it in a loop cycle. Stored procedures are good for this.

DELETE FROM salary
WHERE to_date IS NOT NULL
LIMIT 10000

For example in PHP:

while (1) {
mysql_query("DELETE FROM salary WHERE to_date IS NOT NULL LIMIT 10000");
if (mysql_affected_rows() == 0) {
// done deleting
break;
}
// you can even pause for a few seconds
sleep(5);
}

7). Reference the owner of the object

Always add a table name before column name.

Classic scenario – you start with a simple query:

SELECT
id AS employee_id
, first_name
, last_name
FROM employee
WHERE 1=1
AND deleted_flag = 0
LIMIT 100

The above looks fine BUT if your query changes (which it might) – you decide to join another table which also has a column called id, then you will need to rewrite parts or all of the query …

Example: Someone comes along and asks you to include the contract_date for the employee. This can be found in another table called contract. So you change your query to look like this …

 SELECT id AS employee_id
, first_name
, last_name
, contract.sign_date
FROM employee
INNER JOIN contract ON 1=1
AND contract.employ_id = employee.id
WHERE 1=1 AND
deleted_flag = 0
LIMIT 100

This will fail because some ambiguity has been introduced. The database engine needs to understand which table owns the various columns involved in the query in order to produce a result. Both the employee table and the contract table contain columns called id and deleted_flag.  In the first row we need to be clear what sholud be queried and renamed to employee_id. And then further down in the query, we need to be clear which deleted_flag we are talking about.

So now the query needs to change like this …

 SELECT employee.id AS employee_id
, employee.first_name
, employee.last_name 
, contract.sign_date 
FROM employee 
INNER JOIN contract ON 1=1
AND contract.employ_id = employee.id
AND contract.deleted_flag = 0 
WHERE 1=1 
AND employee.deleted_flag = 0 
LIMIT 100

It is best to always add a table name before column name from the beginning. Even if you are only dealing with one table at the time. That way, you will avoid rewriting the query when things are added to it.


8). Table names always singular

Although we talk about a table, a table is actually a collection of rows and each row is a representation of a SINGLE item. And that is actually what we are looking for. A query returns single items.

The general rules around this are:

  1. All table names are singular – ALWAYS
  2. All table name aliases are singular – ALWAYS

9). WHERE 1=1 (and / or)

This is a big time saver. This allows you to add and remove search criteria as you want. You can change the order of the search as you want.

SELECT id AS employee_id
FROM employee
WHERE 1=1
AND employee.deleted_flag = 0
AND employee.birth_date >= '1960-01-01'
AND employee.birth_date <= '1960-31-12'
ORDER BY employee.birth_date
LIMIT 1000

And now you can remove some search criteria like below. Nice and easy 🙂

SELECT id AS employee_id 
FROM employee 
WHERE 1=1 
--AND employee.deleted_flag = 0 
AND employee.birth_date >= '1960-01-01' 
--AND employee.birth_date <= '1960-31-12' 
ORDER BY employee.birth_date 
LIMIT 1000

10). Old vs. new JOIN style

This is an old join type. Old syntax … Previously you added all the table names to the FROM clause and filtered the result in the WHERE clause. This can be quite heavy depending on the size of the tables.

SELECT employee.id
, employee.full_name
, contract.start_date
FROM employee, contract, lst_contract_tp
WHERE 1=1
AND employee.id = employee_contract_rel.employee_id
AND lst_contract_tp.id = contract.contract_tp_id
AND employee.deleted_flag = 0
AND contract.deleted_flag = 0

Newer syntax is better …

SELECT employee.id 
, employee.full_name 
, contract.start_date 
FROM employee

INNER contract ON 1=1
AND contract.deleted.flag = 0
AND employee.id = employee_contract_rel.employee_id

INNER lst_contract_tp ON 1=1
AND lst_contract_tp.id = contract.contract_tp_id 

WHERE 1=1 
AND employee.deleted_flag = 0

This still allows you to do the same amount of filtering inside the WHERE clause but the datasets returned for the other tables will be smaller because the search criteria for those is declared upfront inside the JOIN

Ideally, you should contain the conditions for the main table in the WHERE clause and the conditions for the other tables in their seperate JOIN clauses. This is not always possible but that guidline should be followed as much as possible. The query will be clearner, more efficeint, more readable and modifiable, better performance ……..


11). Prefix database objects

views with v_
functions with fc_

Also for procedures, it is good practise to prefix the name of the procedure with the primary purpose of the procedure. So if the procedure purpose is to INSERT, the name should be INSERT_procedure_name


12). Don’t use column rows in ORDER BY

The query below will ORDER BY column 1. There is nothing wrong with this syntactically ….

SELECT
employee.id AS employee_id
FROM employee
WHERE 1=1
AND employee.deleted_flag = 0
AND employee.birth_date >= '1960-01-01'
AND employee.birth_date <= '1960-31-12'
ORDER BY 1 -- means order by first column
LIMIT 1000

… but what happens if you add a new column into the SELECT?

SELECT employee.update_dt
, employee.id AS employee_id 
FROM employee 
WHERE 1=1 
AND employee.deleted_flag = 0 
AND employee.birth_date >= '1960-01-01' 
AND employee.birth_date <= '1960-31-12' 
ORDER BY 1 -- means order by first column 
LIMIT 1000

Well your result will now be ordered by employee.update_dt instead of the intended employee.id. Better to do this …

SELECT employee.update_dt
, employee.id AS employee_id 
FROM employee 
WHERE 1=1 
AND employee.deleted_flag = 0 
AND employee.birth_date >= '1960-01-01' 
AND employee.birth_date <= '1960-31-12' 
ORDER BY employee.id -- means order by employee id (ALWAYS) 
LIMIT 1000

13). Use LIMIT 1 as much as possible

The example below is a bit of PHP code which works but is very much overkill. The code below returns all rows where birth_date == $todayDate. Then it checks if the number of returned rows is greater than 0. It is not interested in the number of returned rows, only that more than 0 were returned. Overkill …..

$todayDate = ... // Define
$sql = "SELECT birth_date FROM employee WHERE birth_date = {$todayDate}";
$result = $connection->query($sql);
if ($result->num_rows > 0) {
echo "Yes, there's an employee with this birth date"
} else {
echo "Nobody is celebrating";
}

… use LIMIT 1 if you are only checking for the existence of something, 1 or more exist. This will save huge on performance and efficiency from a percentage perspective. The code below is the same as the code above except the database engine will stop searching when the first instance of birth_date == $todayDate is found. Much much better for this purpose.

$todayDate = ... // Define
 $sql = "SELECT birth_date FROM employee WHERE birth_date = {$todayDate} LIMIT 1";
 if ($result->num_rows > 0) {
 echo "Yes, there's an employee with this birth date"
 } else {
 echo "Nobody is celebrating";
 }

14). Consider data types to use, it makes a difference

Don’t just randomly choose data types. Don’t use varchar255 for everything ‘just to be safe’. Instead form a habit of considering how you can be more efficient, how can you save on storage space. Although storage is cheaper and getting cheaper all the time, that is no reason to waste it.

Integers take 11 bytes of space while varchars usually takes much more. If database size is an issue or of concern, it is worth considering how information is stored on a table level.

Take IP addresses for example. Lets use a varchar20 to store an IPv4 address as a string.  IP address = ‘145.54.123.90‘.

We can use the function INET_ATON(expr) https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_inet-aton to convert the string into an unsigned integer format. This means that instead of using 20 bytes of your precious storage, you use only 11 bytes.

SELECT INET_ATON('145.54.123.90')
=> 2436266842

You can convert the other way using INET_NTOA(expr) https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_inet-ntoa

SELECT INET_NTOA(2436266842)
=> '145.54.123.90'

The above is valid for IPv4 number. IPv6 addresses can be managed in the same way with the following functions:

In order to see the benefit of storing values as opposed to varchars,  you could have 2 identical tables; ip_address_varchar and ip_address_int, one with the original string values and the other with the converted unsigned integer values. Do the following the see the differences between them.

1. Make sure to analyze tables first

ANALYZE TABLE ip_address_varchar
ANALYZE TABLE ip_address_int

2. Verify that the count of rows in each table is the same

SELECT COUNT(*) FROM ip_address_varchar
SELECT COUNT(*) FROM ip_address_int

3. Check the size of tables on disk (in MB)

SELECT
 table_name,
 (data_length + index_length) / power(1024, 2) AS tablesize_mb
 FROM information_schema.tables
 WHERE 1=1
 AND table_name IN ('ip_address_varchar', 'ip_address_int')

Next: Indexes