A case for natural primary keys

A case for natural primary keys

Surrogate keys are a popular method for primary keys in a database. However, there is often a better way.

Background

In all relational databases, there must be a way to uniquely identify every row of data. This job falls to the primary key, one column (or set of columns) per table that uniquely represents that row. You cannot have two rows with the same primary key. These keys are not only used by the database to organize and lookup data for the table, but are usually used as foreign keys, that is a reference from one table to another.

Using randomly assigned ID’s is a popular method for creating primary keys on a database. These can take many forms, such as auto-incrementing integers or randomly generated UUID’s. These keys are then inserted into the table (or generated by it) for every row, making it uniquely identifiable. These are also called Surrogate keys and there are several reasons that are stated for using them:

  • If using UUID’s, you can know your primary key before you insert the row, saving a round-trip to fetch it. This is especially useful for many ORM’s that need the primary key value to work properly.
  • It is easy. Surrogate keys just happen automagically, meaning you don’t have to think about uniqueness of your data.
  • Both methods guarantee every row is unique, you don’t have to worry about duplicating data.

Alternatively, a Natural key is one that can be derived from the data. It may be one column of your existing data that is always unique (such as stock ticker or social security number), a combination of columns known as a composite key (such as name and address), or a hash of several columns into a new column. This means your data is naturally providing a unique key, hence the name.

There are many advantages of natural keys over surrogate keys, mainly:

  • If using composite keys, there is no extra storage used.
  • Natural keys can be computed, therefore there is no round-trip fetch to get them.
  • They force data to be unique in the database, no accidental duplication of data.
  • They can be calculated again and again, meaning rebuilding your data doesn’t cause them to be lost.
  • When build uploading data, foreign keys to other tables can be calculated, meaning you don’t have to look them up at the time you insert new data

Natural Key Advantages

Many database designers seem to shun natural keys nowadays. This probably has something to do with the wide use of ORM systems that generally prefer to use auto-incrementing ID’s by default.

However, the use of natural keys can provide some major benefits.

Natural keys can be computed

When a primary key is computed before the row is inserted to the table, there is no round-trip fetch necessary to get the primary key back. This round-trip fetch is necessary for ORM systems to properly represent the row, for setting up foreign key relationships for example. Leaving the key assignment to the database means always fetching the value back, which is a waste of a network call.

Prevent data duplication

While this could go either way, many types of data shouldn’t be duplicated in a database. You wouldn’t want two copies of a customer or two copies of a product in your database. With surrogate keys, there is no prevention of data duplication. Of course, one could use UNIQUE constraints to prevent data duplication, but that is an extra step. The constraints could be altered or lost over time and that would lead to duplication.

Using the data with a natrual key bakes the uniqueness directly in to each row, there is no need for an extra step.

Repeatability

As natural keys are built from the data, they will always be the same. If a data table needs to be deleted or rebuilt, the keys will remain the same. With UUID’s or auto-incrementing integers, the keys would be randomly assigned if the table is rebuilt.

While primary keys shouldn’t necessarily be used as ID’s outside the context of the database, they oftentimes get used that way. With changing primary keys after a rebuild, any external system using them would be broken. With computed keys, this is not a problem.

Calculate foreign keys

When bulk inserting a lot of data that has foreign keys, you need a way to get those keys. Let’s look at an example. Say you have a table that holds product data, such as name, model number, description, etc. Then you have another table that holds manufacturer information such as name, address, etc. Each product has a manufacturer and you link the two tables with a foreign key.

If you try upserting large amounts of data for different products and manufacturers, you would have to check if the corresponding manufacturer exists in the database, and create it if not. Then you would have to look up the corresponding manufacturer when you go to insert the products so you can get the proper foreign key. This requires many queries to the database to fetch the manufacturers.

Alternatively, if you can calculate the primary key for the manufacturer, then you can simply build all the product rows and insert them without having to query the manufacturing table at all. This makes inserting data much easier and faster.

Natural Key Disadvantages

There are, of course, some drawbacks to natural keys. Mainly,

  • Your data needs to have an inherent uniqueness, which isn’t always the case.
  • If you need to change the data, you must change the key as it is derived from the data.

Summary

While choosing a primary key should always be done in the context of your data, using surrogate keys seems to be overused in many places. This is probably due to the ease of use out of the box, especially when using ORM systems. However, natural keys can provide a lot of benefits and should be considered. Being able to derive the key directly from your data usually means less storage, less roud-trips to your database, and more consistency over time.