Our data engineers have built a lot of data warehouses that run on Amazon Redshift. Redshift can be a great platform when efficiently used, but like many others, it’s difficult to get it work well. We’ve now at the stage where we’ve collected enough information on what works and what doesn’t that it’s valuable enough to pull together our best tips into this post.
Optimizing your schema
There are data engineers start out dumping data into redshift with little thought to the schema and hoping the columnar database engine will magically work out the best way to query the database. We find that this makes it very to run predictive data analytics and your data scientists are likely to spend more time waiting for queries to run than creating new models. If you don’t want to start a battle between your data scientists and your data engineers, we recommend an approach where we implement an efficient schema based on a standard data warehouse star schema. Although Redshift doesn’t contain indexes in the same way that traditional databases do, you need to pay considerable attention to the concept of distribution keys and sort keys.
- Distribution keys – The distribution keys define how data is distributed amongst the different nodes.
- Sort keys – The sort keys are the closest equivalent to a traditional database index.
- Sort key styles – We always recommend to start off by using COMPOUND sort keys as the INTERLEAVED usually are less efficient.
- Foreign keys – Although Amazon Redshift does not enforce foreign key constraints, they are used by the query optimizer, and as such we always include them in our table definitions.
Managing your disk space
When running predictive analytics in the data warehouse, your Amazon Redshift queries can be very disk intensive, therefore good data engineers will also ensure that there is always disk space good headroom. It’s easy to add new nodes to the data warehouse, this can become very expensive. The more data Amazon Redshift needs to read from disk, the longer it’s going to take to run any queries. Although it’s easy to add new nodes to the data warehouse, this can become very expensive. The more data Amazon Redshift needs to read from disk, the longer it’s going to take to run any queries. If all of your tables are sorted correctly, but some of them are still large then you can look at optimizing the compression encoding.
Optimize the slow queries
There are always going to be users who can work out new and unusual ways to run slow queries on your data warehouse. As a general guide, if a query needs to access a lot of data (e.g., reading every record from a historical table) then it’s going to be slow. Anything that doesn’t require a lot of data should quickly execute if it’s written correctly. Note that Amazon Redshift only stored the first 4000 characters of each query, so you may need to contact the users directly to get the full query test.