Granularity

Chris Barber
Nov 4
4 min read

In this article, we discuss another core concept; that of granularity. In doing so, we'll cover:

Definition of Granularity
Why you should (almost always) work at the atomic level
Why more Foreign Keys results in more rows
Sparsity in the fact table

Definition of Granularity

Granularity relates to the level of detail in the fact table. It can be explained in:

Business terms, i.e., Sales is explainable by Date, Product, Department, Geography, Customer, General Ledger (GL) and Order Status
Technical terms, i.e., Sales fact table has foreign keys of Date, ProductID, DepartmentID, GeographyID, CustomerID, GLID and StatusID

In the definitions, you also would have to include any Degenerate Dimensions if they where present in the model, i.e., if "Invoice Number" was a degenerate dimension, that would increase the grain of the model.

Typically, business users won't want to discuss in terms of Foreign Keys (FK), whereas a more technical audience would generally prefer the diagram.

Why you should work at the atomic level

Kimball and Ross in the Data Warehouse Toolkit 3rd edition, state that:

You should strive to store the low-level measurement data resulting from a business process in a single dimensional model

That is to say, if the process you're interested in is stored at the granularity of 8 Foreign Keys you should bring all 8 Foreign Keys into your model, even if the users only express interest in 6 of those Foreign Keys at the time of requirements gathering. There are several reasons for this, the most important of which are:

Scalability

The main reason for going to this "Atomic" level of grain is that it gives maximum analytical flexibility. This maximum flexibility helps towards our Non-Functional Requirement (NFR) of making our project scalable. Again by Kimball and Ross in the Data Warehouse Toolkit 3rd edition, state that:

Atomic data should be the foundation for every fact table design to withstand business users' ad hoc attacks in which they pose unexpected queries

That is to say, your model will be flexible enough to support all the ways in which it is possible to interrogate the data from. It is always possible to work up to aggregated numbers, but cannot work from aggregated numbers to detail

Reliability and Maintainability

Rather than going to the atomic grain, users are sometimes tempted to produce a number of smaller models for each specific need. For instance, if you had 1 department who needed a set of dimensions and another department analysing the same process but from a different set of dimensions. The temptation is to create 2 smaller models. However, this harms our NFRs as it increases Maintainability and Reliability of our model; we now have to maintain 2 models and these could become out of sync with one another.

Compression could actually be worse if we don't use aggregated tables

If we aggregate, the compression in our table could actually be worse than the non-aggregated form. One of the reasons for this is if we remove a foreign key we can end up with fewer rows these rows could contain more unique values (cardinality),. For instance, say we have 100 records in our fact and every value was £2.00 this has a cardinality of 1. If we removed a foreign key this can result in fewer rows, but those rows won't only contain the value £2.00. Power BI adopts column based storage, so the cardinality of the column is a key consideration in the compression of the data. There are other factors to consider also, such as the encoding type of data may be changed making the storage less efficient. These are advanced topics, but for simplicity don't assume aggregation will necessarily improve compression and the size of the model.

Aggregations

Aggregations allow you got have a fact table at the atomic grain, but also other aggregation fact tables as well. This gives the flexibility of the atomic level grain, but also the performance benefits you may get from narrower tables.

Large Volumes of Data

Atomic level data can grow to the 10's billions of rows of data. Power BI (if built well) can handle 10's billions of rows. It may be possible that you're operating above these levels of data, but those cases are rare.

More Foreign Keys results in more rows

As your fact table grows in columns (Foreign Keys), it also grows in terms of the number of rows. Below is a very simple fact table with a single dimension:

Star Schema: One Dimension

If we now add an additional foreign key to our fact table, notice how not only do we have an additional column in our Fact, but we also have additional rows:

Star Schema: Two Dimensions

Sparsity in the fact Table

In the example show the fact has doubled in size; 2 dates x 2 employees resulting in 4 rows of data. However, this is not usually the case; the grain of the fact is not the result of all possible Foreign Key combinations, Instead, it is usually the result of all possible Foreign Key combinations where there is an associated measure. For instance, had all values been against Fiona for the 1st Feb 2022, the number of rows would have been 3:

Star Schema: Two Dimensions Sparsity

Notice how there are no Values assigned to James on the Fact table for 2022, therefore no row is required on the fact table; we can infer the value for James is zero on 1st Feb 2022 without requiring a row. This behaviour is called sparsity in our fact table.

Conclusion

In this article we've introduce the concept of granularity and outlined why we should almost always aim to work at the atomic level.