Designing Data Warehouses for Better Understanding
For humans and LLMs alike
Foreword
Earlier this year I started working on an open source data handbook with instructions to help early stage companies set up data infrastructure and build data teams.
This post introduces one of the core sections of the handbook called “Good Conventions” - which is aimed at helping you design data models and warehouses such that they are easy for humans and large language models to understand.
You can find the ultimate data handbook online here and also on Github if you’d like to clone, edit and host it yourself.
Introduction:
Designing a data warehouse that is easy to understand for both humans and large language models (LLMs) is crucial. A well-designed data warehouse can significantly improve data discoverability, maintainability, and overall efficiency.
Brief
In this blog post, we will discuss three essential aspects of data warehouse design in brief:
Schema separation
Table nomenclature
Column nomenclature.
You’ll find more detailed instructions and examples on the ultimate data handbook.
Schema Separation
Separating your data into distinct layers is the foundation of a well-designed data warehouse. It allows for a clear separation of concerns between the different stages of your ETL (Extract, Transform, Load) process. As a general principle, you should have at least the following layers in your data warehouse:
Source:
Contains raw data or staging data with minimal cleaning and homogenization.Intermediate:
Provides a space to store intermediate results of your ETL process without crowding the user-facing layers.Core:
Stores the facts and dimensions that are the foundation of your data warehouse, mapping directly to business concepts.Marts:
Denormalised views optimised for specific use cases, built on top of the core layer.
By separating your data into these layers, you can control data access, improve discoverability, and make your data warehouse more maintainable.
Table Names
Consistent and descriptive table names are essential for understanding the content of a model without having to dive into the details. Table names should follow these basic conventions:
Use a prefix to indicate the type of model (e.g., stg_ for staging, fct_ for facts, dim_ for dimensions).
Include a descriptive name that clearly conveys the content of the model.
Use lowercase, snake_case, and plural names.
Be concise but not ambiguous.
For example, a staging table for Salesforce opportunities could be named stg_salesforce_opportunities, while a fact table for monthly customer revenue could be named fct_customer_revenue_monthly.
Column Names
Column names should be descriptive and follow a consistent naming convention. Some key principles to follow include:
Optimise for clarity, not brevity.
Use lowercase, snake_case names.
Avoid reserved words and limit domain-specific abbreviations.
Use consistent and descriptive identifiers.
Additionally, specific data types should follow certain naming conventions:
Boolean columns should be prefixed with is_, was_, has_, or had_.
Date columns should be suffixed with _date, _month, or _year, depending on granularity.
DateTime columns should be suffixed with _at and use UTC timezone by default.
Foreign key columns should be named after the referenced table and suffixed with _id.
Aggregate columns should take the form of aggregation_object_context_units.
Conclusion
Designing a data warehouse with clear schema separation, consistent table nomenclature, and descriptive column names is crucial for making your data easily understandable and maintainable.
By following the guidelines outlined in this blog post and referring to our detailed data handbook for more information, you can create a data warehouse that is both human- and LLM-friendly.


