dbt: A New "Species" in Data Industry

Deep dive on this $4.2 billion data transformation tool

Dec 19, 2022

Introduction

The rise of data clouds like Snowflake has recaptured the world's attention to the status of SQL as the key language in the data industry. Data tools, especially SQL, have re-gained traction, dbt being one of the beneficiaries.

As a data transformation tool, dbt compiles users' codes into SQL statements that can be executed in data warehouses. Previously, the transformation of data required more complex language and the majority of data analysts only knew how to use SQL, making it almost impossible to do their job without the help of data engineers. Today, the emergence of dbt and other data processing tools based on SQL enables data analysts to transform and analyze data on their own.

Being an important component of the new generation data stack, dbt is often used in combination with Snowflake, Fivetran, Looker and other similar products. dbt is easy to navigate and boasts an energetic community, which are two of the good qualities of good open-source products. In addition, without any direct competitors, dbt is also the only mainstream solution provider in its field.

Dbt is the rule setter for the new category in the data industry, giving rise to an emerging job -- analytics engineers. In 2021, the number of analytics engineer jobs requiring candidates to master dbt grew by 3 folds from the previous year. Today, over 25,000 people join the dbt community on Slack, 9,000 enterprise employees use dbt Core (open-source version), and 1,800 enterprises are paid users. In the past year alone, the number of paid users has doubled with the ARR increasing by 6 folds, testifying to the great value and influence of dbt.

However, when it comes to investment and enterprise development, there are still some issues that dbt needs to overcome: the glass ceiling is relatively low, and its profit-making products have yet to achieve "Product/Market Fit". The former issue is caused by the low average revenue per user (ARPU) of dbt, which is only 1/30 of that of Snowflake, while their users overlap to a large extent. For dbt to break the glass ceiling, we believe that it can attempt to adjust the pricing model and expand the product matrix, while strengthening the monetization capability of the teams.

Undoubtedly, dbt is a unique player in the modern data stack with great value. Also, people genuinely love dbt. Therefore, we believe that dbt is a company that is worth watching closely.

Table of Content

01 Problems dbt Solves

02 From ETL to ELT

03 dbt Business and Growth Potential

04 Why Does dbt Merit Our Attention

05 Conclusion and Suggestions

1. Problems dbt Solves

An enterprise's data comes from multiple sources, chief among which are database, documents, data sheets, and the Web. As data is scattered, data quality varies with inconsistent data definitions and business attributions, it is necessary to ETL (Extract, Transform, and Load) or ELT data so that it can be used for analysis, visualization, and modeling. dbt provides solutions for "T", or data transformation, in this process.

Data transformation refers to the process of transforming the data from one format/structure into another, which is an important part of ETL/ELT. Data transformation includes the following processes:

Data cleaning: it aims to solve the problem of data missing and inconsistency

Data standardization: formatting data

Duplicate data deletion: deleting redundant data

Inspection: the system reports low quality and abnormal data

Sorting: sorting data by category

These are often done by data engineers. When data is made accessible, data analysts start the data analysis process to gain valuable insights.

Generally, the amount of data one data engineer is able to process is enough to meet the demands of 2-3 data analysts. However, data analysts far outnumber data engineers, making it hard to cater to the needs of all data analysts. Even if there are enough data engineers, it does not make sense to allocate one data engineer to 2-3 data analysts.

Also, data engineers and analysts work in silos, leading to tremendous communication costs and low work efficiency. Data engineers' understanding of the business is not as sophisticated as data analysts', whilst data analysts are not familiar with the subtle differences between datasets. With different metric definitions and calculations, data analysts often repeat statements already written by other colleagues, slowing down the decision-making process and undermining the quality of the decisions.

Therefore, the optimal solution is to enable data analysts, instead of data engineers, to ETL/ELT data.

That said, data analysts cannot finish the ETL/ELT work with the tools currently available, because the ETL/ELT process requires complex programming language and most engineer analysts only get the hang of SQL. In this case, to empower data analysts to do their job, two solutions stand out: One is to motivate data analysts to learn to program, which is obviously not realistic; the other one is to make the ETL/ELT process possible via SQL, which is what dbt is doing.

dbt is an enabler for data analysts to transform data. With dbt, data analysts can easily navigate data transformation via SQL statements independently without any help from data engineers. The "E" and "L" parts of the "ETL/ELT" process can also be done via SQL, with Fivetran facilitating "E" and Airbyte facilitating "L". But this article will not delve into this.

2. From ETL to ELT

As a data transformation tool, dbt compiles users' codes into SQL statements that can be executed in data warehouses. Being an important component of the new generation data stack, dbt is often used in combination with Snowflake, Fivetran, Looker and other similar products.

Cloud data warehouses have become the central platform for data analysis, where the data processing method changes from ETL to ELT.

The traditional way to process data is "ETL", where data is first extracted for transformation before being loaded into data warehouses. Yet, as cloud data warehouses like Snowflake, Redshift, and BigQuery become increasingly important, they become the central platform for data analysis. The advent of the new generation data stack has given birth to a new way of data processing: data is first loaded into the cloud data warehouses before transformation, or ELT as we know it. This prompts the shift from ETL to ELT.

In the ELT process, tools like Fivetran are used to extract data from data sources. Then, the data is loaded into cloud data warehouses, where dbt is used to transform the data into formats needed for analysis. Compared with ETL, the ELT process is more flexible and convenient.

In addition, as data caculation and storage are separated in cloud warehouses and the performance and expandability of data warehouses are also on the rise, processing data in data warehouses becomes possible. dbt thrived, to a large extent, on the boom of cloud data warehouses. It is even fair to say that dbt is an enterprise emerging from the rise of cloud data warehouses.

The rise of cloud data warehouses like Snowflake has thrust SQL into the spotlight of the world once again.

SQL is the "mother tongue" of databases. Ever since the advent of Hadoop, people have transferred workloads from data warehouses to novel data lakes. The open-source Spark became the standard language of data lakes in 2010. The success of Hadoop and Spark has captured widespread attention, but it is worth noting that the majority of data transformation is done via HiveQL, a "spin-off product" of SQL, even in the heyday of Hadoop.

In recent years, the rise of cloud data warehouses using SQL like Snowflake has regained the world's attention to the status of SQL as the key language in the data industry.

Also, one major pain point of the data industry is that not all potential data consumers can serve themselves independently. The barrier of entry has limited the promotion and monitzation of data tools. With a low entry threshold, SQL can retrain users that have experimented with other languages. In the long run, the number of SQL users will be far higher than that of other languages. Therefore, putting SQL at the core means that more people can consume and use data independently, leading to a bigger market.

3. dbt Buisness and Growth Potential

dbt is comprised of two basic components: a compiler and a runner. Users write dbt codes in the text editor and call dbt from the command line. dbt compiles all codes into SQL and executes these codes against the configured data warehouses.

However, the goal of dbt is not to serve as a library of SQL transformations. Instead, it aims to provide users with powerful tools to build and share tranformations and models. For this to happen, dbt is equipped with a package manager, which is akin to a sharing platform. It allows users to publish repositories of dbt codes that can be directly referenced by others. dbt transforms data analysts from tool users to tool developers.

According to user interviews, in addition to "data transformation", dbt's package manager is also popular among users. One GitLab data analyst said, "With dbt, once the data transformation process is defined, the codes will be stored on dbt. You can modify and iterate them at any time. dbt also supports peer review."

"When using dbt, you feel like you are a software engineer. You can use the language you like to program and upload your codes to GitHub for peer review and feedback. You can also run and change your codes directly on dbt. We have not seen major flaws yet. Once we and others build a model or pipeline, we can use it in a repeated manner to avoid repetitive work. It provides a tool to define standards, which can be used by everyone repeatedly", according to one data analyst.

Besides, dbt provides data models, data testing and data documents, among other functions.

Data models: It refers to modular data storage. This allows users to express the program structure clearly so that the teams can better understand what users are doing. Also, it makes it easier to debug the codes and models made by you and others.

Data testing: It refers to helping teams test data to improve data quality. dbt evaluates data quality in the following four stages: source data, modeling, model deployment, and model & code walk-through. Data testing can also pinpoint the location of source data.

Data documents: The documents include documents recording "how ARR is defined" and the source data these metrics rely on. These documents help address common database-related questions and align the teams' understanding of the data for easy team collaboration.

Enterprises using dbt include Canva, GitLab, and HubSpot, with users mainly being data engineers, data analysts and data scientists. As a matter of fact, dbt also gives rise to a new job - analytics engineers, which we will touch on later.

From a monetization perspetive, dbt has three versions: for individuals, for teams, and for enterprises. Specifically, the version for individuals is free-of-charge for good, the team version charges users USD 50 per seat per month, and the enterprise version adopts a customized price in light of the number of seats, as well as additional functions and services needed.

Meanwhile, dbt Core is open-sourced, so engineers that intend to run dbt locally or on the enterprise's internal IT infrastructure can deploy it on their own.

In 2021, dbt achieved many milestones:

Community: The dbt community on Slack is home to more than 25,000 users. dbt has 12 Meetup groups in 8 countries.

Client: At present, a total of 9,000 employees use dbt Core (open-source version). 1,800 enterprises are paid users. In the past year alone, the number of paid users has doubled with the ARR increasing by 6 folds.

Partners: In 2021, the number of dbt's strategic partners increased by 2 folds.

New use cases: dbt Cloud API supports over 25 enterprise applications.

Team: The team scale is expanded from 50 people to 200, an increase of 4 times.

Weekly Active dbt Projects

All metrics testify that dbt Core has achieved Product/Market Fit, and is growing rapidly. Yet, the revenues and user feedback show that there is still some way for dbt's paid products to reach PMF.

4. Why Does dbt Merit Our Attention

01 dbt Is the Rule Setter for the New Category in the Data Industry

Some data analysts only know how to use SQL language, making it impossible to complete data analysis on their own, while data engineers do not have sophisticated business understanding and insights. This is where dbt comes in. dbt aims to bridge the skill gap between data analysts and data engineers. It gives birth to a new job that stands in the middle of the spectrum - analytics engineers.

The job of analytics engineers is to use SQL to build data pipelines and models, something that used to be done only by data engineers or data scientists.

When a job is defined by a tool, many will speculate whether the tool is genuinely needed in work or the tool is not actually needed and it is just that the company requires it to show its value. Data fetched from multiple recruitment websites show that the number of posts requiring analytics engineers to master dbt tripled in 2021 alone. This is a testament to the actual value and influence of dbt.

02 dbt Has What It Takes to be a Good Open-source Company

With friendly UI and little friction, dbt is easy to navigate and can solve problems effectively. The fact that it uses SQL, rather than other complicated language, to transform data makes it easy to use. Also, easy navigation is enabled by its friendly product design and interaction. The abundance of dbt teaching tutorials reduces the barrier to entry for beginners.

In addition to its strong product capability, dbt boasts a dynamic community, which is crucial for a bottom-up open-source product. dbt has over 25,000 active community members from 8 countries and regions on Slack with great participation. Their contributions and feedback make dbt better.

In fact, like other open-source communities, the early version of dbt was to provide a channel for early adopters to communicate and give feedback. As companies like HubSpot and Casper began to use dbt, their employees joined the Slack community. This has provided the dbt community with two benefits. First, big companies have a large number of employees. When these employees leave the company, they can introduce dbt to another company. Second, employees in big companies have good qualities and a broader vision. They lead the discussion from "how to use dbt" to discussions about the modern data stack, making the dbt community a well-known data forum. Some users even take the initiative to build a local dbt group. Today, dbt has influential community groups in places like Silicon Vally, New York, London, and Sydney.

dbt will hold the Coalesce conference every year. The online conference attracts new members to the community. Besides, dbt community members share articles and blogs about dbt on Substack or other channels. dbt also launches "dbt Learn", an activity that also provides expertise and services for existing users, to bring new users to the community.

Its strong community foundation has spared dbt any sales and marketing spending before it starts its financing. The company relies totally on organic growth. This provides a window into the superiority of dbt and the affection of users.

An easy-to-use product coupled with a dynamic community, these are the two qualities dbt shares with good open-source products.

03 A Founder Worth Betting on

As the founder of dbt, Tristan Handy is very outstanding. In early 2013, Tristan joined RJMetrics (later acquired by Magento) as the marketing VP, where he led a project called Stitch in 2015. Later, Stitch was separated as an independently operating business unit.

In 2016, Tristan left RJMetrics and founded Fishtown Analytics, which was originally an analytics consultancy aiming to help start-ups analyze their business. At that time, FA was operating an open-source product named dbt (standing for data building tool). The popularity of dbt has made it the core business of FA. FA officially changed its name to dbt Labs in June 2021.

Tristan Handy has a habit of writing blog articles. He often shares his thoughts and insights about dbt and the data industry on his blog. Tristan shared his ideas about the development trends of the modern data stack on his blog multiple times. He expounded on the thinking behind important decisions and strategies of dbt, including why dbt does not support Python, why it is at such a low price, and why dbt does not start with acquiring enterprise clients.

Tristan will recap the development of dbt and design future plans there. By interviewing the management of dbt, we learn that Tristan will make clear targets for each development stage, and it is impressive that almost all the targets are right on track.

Also, Tristan is very ambitious. He has high requirements and expectations for his own life and dbt. Investors and the dbt C-Suite familiar with Tristan said that Tristan spends every waking second, including eating, walking and travelling with the family, contemplating how to make dbt a great company.

After dbt finished its C Series funding, Tristan said, "I am thinking about how I spend my time on this planet and what I should put on my gravestone? If my answer to one of these questions is "make tons of money", then you might be reading an article about how Fishtown Analytics secured a good buy-out offer. However, I cannot image myself making such a decision. The reasons for this round of financing are the same as dbt not being acquired -- to remain independent."

When asked what kind of company dbt will become, Tristan said, "I want to create a smithy." Users can buy and use whatever data processing tools they want. Obviously, the future Tristan envisions is not represented by a small tool we see today.

04 Virtuous Competitive Environment and Absolute Competitive Advantages

In addition to dbt, the ELT tools available on the market include Airflow, Fivetran, AWS Glue, Dataform, Matillion, and Panoply. But in fact, if we focus on data transformation, it is fair to say that dbt has no direct competitor and is the top (and only) choice for clients.

But two kinds of companies could become its future competitors. One is companies like Fivetran and Airbyte, which are responsible for the "E" and "L" parts of the ELT process and are likely to cover the whole "ELT" process going forward. The other is Snowflake, Databricks and the data warehouse platforms of three major cloud service providers.

The market demands for dbt are largely based on the boom of data warehouses. It is reasonable to bill dbt as the enterprise that rises up from data warehouses. In this case, data warehouse platforms can develop their own ELT tools as substitutions of external tools, or choose to work with other ELT companies. These are the competition and challenges facing dbt in the future. But currently, there is no data transformation tool that manages to pose a threat to dbt. Therefore, dbt enjoys a good competitive environment with absolute competitive advantages.

5. Conclusion and Suggestions

Factors driving up the growth of dbt include the maturity of cloud data warehouses, the thriving development of SaaS and data ecosystems, and data transformation and migration needed for making them accessible on multiple cloud servers. Among them, the maturity of cloud data warehouses is the key. The most common combination is Fivetran + dbt + Snowflake. In fact, dbt can be considered as an enterpise in the Snowflake ecosystem. In this case, timing will not be the key issue. The ecosystem is ready. The Snowflake ecosystem will be one of the best data infrastructure ecosystems in the next 5-10 years.

The key lies in how big the market is and how robust the product demands are. As per user interviews, we learn that the major users of dbt are the same as those of Snowflake. However, since dbt offers a free open-source version, the number of paid users is way lower than that of Snowflake. Plus, one of our major concerns is the ARPU of dbt's paid users is relatively low and we have yet to see any possibility of growth. This has something to do with the low price and the unscientific pricing model of dbt. Several users mentioned that they spend about USD 10,000 on dbt each year, while some only pay several thousand dollars annually, registering almost the lowest spending in the whole data stack if you think about the USD 100,000-600,000 they lavish on Snowflake.

On top of that, dbt is not a must-have product. When users are asked "if the budget is slashed, which product in the data stack will you choose to give up", dbt often comes up at the top or middle of the list, while Snowflake is the last product they want to abandon.

Hence, we can work out dbt's potential revenues and its future market value by benchmarking it against the market scale of cloud data warehouses. According to Maximize Market Research, the market scale of cloud data warehouses was 26 billion dollars in 2021. It is expected that it could reach 47.8 billion dollars by 2027. There is a difference of 30 times in ARPU between dbt and Snowflake. So even if dbt is the only player in the field, the room for growth is limited.

With good market expectations for Snowflake's ecosystem, the market valuation/cap of dbt can enjoy high premiums. So we can also compare the market cap of dbt with that of Snowflake. When Snowflake went public, its market cap broke the 100-billion-dollar mark at its peak. After market adjustment, it was still worth over 40 billion dollars. As the users of dbt and Snowflake overlap greatly, given the 30-time difference in ARPU, dbt's future market cap is projected to be only be 3-5 billion dollars. Since users' needs for dbt are not as strong as Snowflake, the number of dbt's paid users is much lower than that of Snowflake. So it could be lower than 3-5 billion dollars. While dbt is estimated to be valued at 4.2 billion dollars in the Series D funding round, we believe that this valuation is actually higher than what it is actually worth.

That said, the penetrate rate of dbt and cloud data warehouses is still very low. Growth has never stopped. Several billion might not be the cap. Apart from being widely used on Snowflake, dbt is making its way to other cloud data warehouses (like RedShift and BigQuery) and DataLake/LakeHouse (Databricks). In terms of its products and teams, dbt does have a killer product, which hasn't found any direct competitors.

Overall, given the market maturity and superiority of products and teams, dbt is a safe bet for investment. However, as the market is still small and relies on the ecosystem of cloud data warehouses, dbt is still unlikely to become a platform itself, so it can only be seen as a 6-point subject, not a 9- or 10-point one. dbt and cloud data warehouses are at the stage of rapid growth. From the perspectives of increment and market expectations, it is still worth investing in. But remember to keep track of the growth of cloud data warehouses like Snowflake. The future scale of dbt, to a large extent, rests on the growth potential of the cloud data warehouse ecosystem.

For dbt to garner more revenues and increase the glass ceiling, we believe it can try the following three things:

Verify whether its monetization products have achieved PMF so as to enhance its profit-making capability.

Adjust the pricing model. dbt adopts the per-seat pricing model. This has imposed some restrictions on dbt's future revenues. As data volume growth outpaces user growth greatly, it might be a better choice if data infrastructure products use the usage-based pricing model in the long run.

Once the monetization products finish the PMF verification, dbt can march into the "E" and "L" parts of the "ELT" process. Going forward, in light of its product relevance, dbt can also dabble in other fields of the cloud data warehouse eco-system.

Etna Research

Discussion about this post