Keys - Primary, Surrogate, Natural and Foreign

Chris Barber
Oct 26
4 min read

Keys are a core component of star schema modelling; relationships between tables are built using the keys. This article covers:

The main key types
Star Schema diagrams
Best practices when using Keys

An understanding of keys become increasingly important with more complex solutions. Not only do you need to understand them from a modelling perspective, but a common vernacular is required to communicate with team members.

The 4 Main Key Types

The hospital star schema will be use to explain our 4 main keys types:

Hospital Star Schema

Primary Keys

Primary keys uniquely identify each row within a table. For instance, in the hospital example each of the dimension tables has a primary key; "Date", "Hospital ID" and "Department ID". With regards to fact tables, the general rule is to avoid primary keys. This is because, in an analytical environment, we are generally interested in "summing up" or "aggregating" multiple measures from our facts and explaining them in different ways rather than focusing on an individual line items within the fact.

Surrogate Keys

Surrogate keys are a type of primary key initiated by the database (i.e., not from the source system) to uniquely identify a row. It is considered "best practice" to create unique keys in the database rather than relying on source systems. One reason for this is because BI systems compare data across multiple source systems; using the database to create the key ensures that a common key is used. Therefore, the keys in the dimension table ("Date", "Hospital ID" and "Department ID") would typically be more specifically referred to a surrogate keys rather than primary keys.

Foreign Keys

Foreign keys are the representation of a primary key in another table. In our example, the "Date", "Hospital ID", "Department ID" as expressed in the fact table are foreign keys. Notice how the foreign keys are duplicated in the fact table, but surrogate keys are not in the dimension table; this is central to the star schema design. In the majority of instance, fact tables are made up of only the foreign keys and the associated measures, as is in the example.

Natural Keys

Natural keys come from the source system. For instance, we may have a hospital natural key called "Hospital Code". As it comes from source systems, this usually has some business meaning unlike the "Hospital ID" (a surrogate key) and users may frequently refer to hospitals by their "code" (natural key). The same applies in many other scenarios:

Employee may be referred to by "Employee Code" (natural key from HR system)
Product may be referred to by "Product Code" (natural key from ERP system)
Project may be referred to by "Project Code" (natural key from project management system)

All the above happen in practice, so we often keep the "Natural Key" in our dimension to give users the flexibility to query their data by this. It is also common practice to create an attribute in the dimension which combines the natural key and its subject. For example, a computer company may have an attribute combing the Product Key (1074) and Product Name (Keyboard) into a single attribute (1074 Keyboard).

Star Schema Diagram

Typically, we don't use the Star Schema model as has been shown so far. Instead, we use a schema using tools such as Diagrams.Net:

Hospital Schema

In this diagram:

SK - refers to surrogate key, sometimes the less specific PK for primary key is used
FK - refers to the associated foreign key
AT - refers to all attributes
ME - refers to all measures

There are often multiple ways to model a problem. Thus multiple schema designs are often created showing the problem to be modelled in multiple ways; from this a conversation can be had around which is the most applicable solution in a given scenario.

Best practices when using keys

Data Type

Power BI allows you to store your data according to different data types, for instance as a Boolean value, an integer or a string. The best practice for keys is to store as a integer. This is because:

An integer is more efficient from a storage perspective
Reduces the likelihood of error with joins. For instance, a string of "SK_1" and "SK 1" would not join because of the missing "_"

The exception to this rule is DateTime data type; in this, the date part gets stored as an integer and the time type gets stored as a decimal. It is, usually, best practice to keep DateTime without a time component (i.e., 09/03/2022 00:00:00) and manage time separately.

Separating Date and Time

The Power BI engine is columnar and performs best with dimensions of lower cardinality; the number of unique rows and records. By combining data and time into a single dimension, the cardinality drastically increases. If we have every second for 1 years worth of data in:

A single DateTime dimension, the surrogate key would have a cardinality of 31.536m records (365 days times 86,400 seconds per day)
Separate Date and Time dimensions, the surrogate keys would have a cardinality of 365 records and 86,400 records respectively

For this reason, splitting out date and time - where the time surrogate key is stored as an integer - is considered best practice.

Naming Convention

As a rule, surrogate keys and their associated foreign keys should have the same name. This helps with the understandability and maintainability; it becomes difficult to understand a model when keys named differently. The exception to this rule is Role Playing Dimensions; where there are multiple relationships between 2 tables. This exception is covered in a later article.

Hide Keys

An overriding objective of data modelling is to create models which are understandable. Therefore, anything that does not have business meaning should be hidden from the end user. Any keys which do not provide value to the end user, should be hidden in the model; hiding retain the relationship, but when the user interacts with the model (i.e., through Q&A) the relationships do not show.

Conclusion

This article has outlined keys and their importance in the Star Schema design. If you are working on your own, you may not need to have a common terminology but you need to at least understand the concepts. However, as soon as you progress beyond working in isolation (or even have to pass the model to someone else to maintain) a common vernacular is required.