We all are aware of the benefits of a data warehouse—but how do we get into picking one? These are a few that can help you out.
Data types:
There are three types of data that you might want to store for your business: structured, unstructured, and semi-structured. Most data warehouses support structured and semi-structured data management, but unstructured data is a better fit for data lakes.
Structured data is quantifiable data that can be organized neatly into rows and columns (e.g., sales records or customer contacts).
Unstructured data is data that can’t be easily managed and analyzed. Think of writing content (like blog posts or answers to open-ended survey questions), images, videos, audio files, and PDFs. If you’re looking to store purely unstructured data, you should consider a data lake instead of a data warehouse.
Semi-structured data is a mix of structured and unstructured data. Take an email, for example. The content of that email is unstructured, but there are quantifiable aspects to the email, such as who sent it, when they sent it, when it was opened, etc. Similarly, an image itself is unstructured, but you also often have access to structured data like the time the photo was taken, device type, photo size, geotags, etc.
If semi-structured data is important to you, BigQuery and Snowflake are two data warehouses known for having the best infrastructure to support storage and queries for semi-structured data.
Scaling for data storage:
Most data warehouses typically allow you to store massive amounts of data without much overhead cost. You probably won’t need more than what they offer, especially if analytics is the primary use case.
However, you’ll want to consider how a particular warehouse scales data storage during times of demand. For example, Amazon Redshift will require you to manually add more nodes (the basic structures in data warehousing that store data and execute queries) when you need more storage and computing power. On the other hand, Snowflake offers an auto-scale function that adds and removes clusters of nodes dynamically as needed.
Scaling for performance:
The performance of a data warehouse refers to how fast your queries can run and how you maintain that speed in times of high demand. As you can imagine, scaling for performance and data storage are closely connected. Like storage, performance will increase as you scale up the nodes in your warehouse.
These days, speed is a non-issue. Every warehouse is about as fast as the others. What you really want to consider in regard to performance is how much control you want over your speed.
Similar to how a data warehouse’s storage scales, you can add and remove nodes for faster queries. For some warehouses, like Redshift, you need to do that manually, but you’ll be able to tune it as precisely as you like. For others, like Snowflake, it will happen automatically for a hands-off experience.
Maintenance:
You likely want your engineers focused on building and maintaining your products instead of worrying about ETL pipelines and the day-to-day management of your warehouse—especially if you have a small team. In that case, you’ll want a data warehouse that is self-optimizing like BigQuery, Snowflake, or IBM Db2.
However, by maintaining your warehouse manually, experienced data warehouse architects can have greater control and flexibility to optimize it precisely for your company’s needs. If you want this level of control over your warehouse’s performance and cost, Redshift and PostgreSQL are your best options.
Ecosystem:
Consider using a data warehouse that is within the ecosystem of the applications you already use. For example, Azure Synapse Analytics is in the ecosystem of Microsoft products, Redshift within AWS, and BigQuery within the Google Cloud ecosystem. This will simplify implementation since you already have an infrastructure in place.
Otherwise, you’ll need your engineers to develop multiple custom ETL pipelines to get your data where it needs to be. You may still need to write a custom ETL to get data into your warehouse from certain data sources, but the goal is to minimize that work.
Cost:
Many factors go into data warehouse pricing, including storage, warehouse size, run time, and queries. For Redshift, you pay per hour based on nodes or per bytes scanned. BigQuery, on the other hand, has both a flat-rate model and a per-query model. Snowflake, IBM Db2, and Azure are all based on storage and compute time.
Ultimately, you want to choose the data warehouse that will do what you need it to do, not just choose the cheapest option.
For companies with a limited budget, PostgreSQL is a great free option and still has plenty of features. When you’re ready to upgrade, it’s easy to switch data warehouses, especially if you’re using a customer data platform like Segment that can communicate between the two warehouses seamlessly.