Relational databases have been around for a long time and they have been the primary way to store data in a consistent way. Therefore, there is a lot of information surrounding them. NoSQL databases on the other hand, have become increasingly popular in recent years. Working with them is initially easier, but becomes more difficult once the data size starts to grow or the structure gets complicated.
One of the challenges you might have to deal with down the road is performance. Sadly, there are only a few articles warning about the limitations of NoSQL DBs. And it leaves quite a lot of room for poor choices when it comes to database design.
This article discusses how to make the most out of your current MongoDB in the context of ROI Hunter’s setup. It might apply to other document-based NoSQL DBs too. After the optimizations featured here, the size of our database decreased from about 630 GB to 107 GB per replica.
The queries speed increased fourfold. Before the optimization, there were hundreds of queries which took longer than 2 minutes to complete per day. After optimizing, not a single query took longer than 30 seconds. We also saved about 60 % of the initial cost per month on cloud infrastructure.
Introduction to ROI Hunter Marketing Statistics Service
Before we dive into the details, you need to understand what our service does and what it looks like. It stands to reason that optimizations results depend on the nature of the data stored and how it is used. This section is then essential.
Our ROI Hunter application heavily utilizes marketing statistics from various sources?—?Facebook Insights, Google Analytics, clients’ CRMs and other sources, mobile measurement statistics and possibly more in the future. These marketing statistics are not only useful for visual representation to the end users but also for billing, KPI calculation and most importantly, automation.
It is therefore important to be able to present this data in a timely manner with little room for error. Based on these requirements, we designed a service to write and read the data. We called it Daisy, and it is based on MongoDB.
The statistical data consists of documents with simple key-value pairs organised in trees, where the key represents the name of the metric. The value represents its value, which can have various data types.
Each document represents a single day for a single ad-entity (ad, ad set or campaign). The data is not very useful if it is presented for each day and each ad-entity separately. Therefore getting data for more than a single day or a single ad-entity requires an aggregation.
In fact, that is what 99 % of all our queries do. The application must also support sorting, filtering, projection, grouping and even custom formulas evaluated on the DB level. Because of all that, the service must build the queries (aggregation pipelines) at runtime dynamically. They can be extremely large: tens or even hundreds of kB long.
Each document has about 25 main metrics. Due to the amount of FB actions (a kind of Facebook metric) which has about 15 attribution windows, there can be hundreds of metrics. I won’t describe the attribution windows here, but bear in mind it can make the number of metrics up to 9 times higher.
We also store unique data?—?the problem being that it cannot be aggregated. For this reason, we have to store them separately for arbitrary time ranges. Right now, there are six of those. This basically creates another dimension in the DB schema. With a bit of simplification, it means there can be thousands of metrics overall in each document.
At the time we decided to optimise our database, we had about 200 million documents in the DB. Each document took about 16 kB in memory (uncompressed), which means the whole uncompressed dataset would barely fit in about 3.0 TB of memory.
We had two shards, each consisting of 3 data bearing replicas. Each replica had available 12 CPUs and 72 GB of memory. This setup was necessary because of diminishing performance and growing space requirements.
In the following sections, I’ll describe how we managed to decrease the number of documents to 160 million. Each document was taking up around 2.5 kB of memory, reducing the overall uncompressed footprint to about 381 GB.
In this part of the article, I’ll describe all suggestions that can make your MongoDB setup perform better.
Some recommendations in this guide help with:
Sadly, I do not have hard numbers on how much each suggestion helps on its own, only an overall result. However, you can estimate it with your own data and then decide whether it is worth it for you.
We have figures on the overall result only. However, you can estimate it with your own data. [Credit: Grohs Fabian]
Suggestion 1: Remove Unnecessary Data
This suggestion is a bit obvious, but it still might be rather difficult to implement. Some data can be easily removed. In our case, we removed documents of clients that were no longer using our platform, which reduced the number of documents by about 40 million.
However, there is data that can be quite tricky to remove?—?which is usually redundant data present in each document. Each document, therefore, needs to be updated. This is difficult because such an update can take hours to complete. And you might need to do it for each replica separately.
In our case, we removed marketing metrics that Facebook removed from their API and implemented this recommendation as part of a bigger migration.
Suggestion 2: Infer Data Whenever Possible Instead of Storing It
This might apply to you only if you execute many aggregations. You should aggregate data that can be aggregated when you read it, instead of storing aggregated data in the database. Some of the data can take a lot of space so I encourage you to do a lot of aggregations. The amount of space it saves can be quite significant, as in our case.
As I mentioned above, we stored 15 attribution windows for each Facebook action metric, but 9 of those can be easily inferred from the remaining 6. So we only need to store 6 attribution windows and aggregate the rest when someone asks for it. This helped us a lot because, as I said earlier, the attribution windows multiply the size of data we need to store in each document.
We also started inferring some metrics from other metrics. For example, Facebook metric
frequency can be calculated as
impressions divided by
reach. Overall, we removed those 12 metrics.
Suggestion 3: Optimize Your Database Schema
This is one of the most difficult recommendations to implement as it requires a huge MongoDB migration which can be hard to execute without downtime. If you have mostly worked with relational databases in the past, working with documents instead of rows can make the way you implement your schema sloppy.
Keep in mind with document databases, each record (document) contains both the data and the schema. Another thing to bear in mind is that those documents can include nested documents, which means the whole document is a tree rather than a vector.
Storing such documents can require a lot of space so you need to think about how you want your schema designed. Consider this document:
And this document:
As you can see, both documents contain the same amount of information, but the second one is significantly smaller. I suggest keeping the small sub-documents that contain the least amount of schema as deep in the tree as possible. On the other hand, large documents should be as close to the root of the tree as possible.
This recommendation is more important if you have more dimensions, like in our case. The data above has only 3 dimensions: action name (
offline_conversion:add_to_wishlist), metric name (e.g.
count) and attribution window (e.g.
Our data has one more dimension?—?date range for unique metrics (e.g.
2d?—?2 days), which has 7 members (
7d etc). This dimension was the root of our documents, which is the exact opposite of what you want.
Suggestion 4: Use Short Names for Your Fields
This idea follows up on the previous suggestion. If you implement this recommendation well, the previous one might not be that important. I got this idea after looking at the
oplog table in our own version of MongoDB. The
oplog table uses very short names for its fields, which makes a lot of sense once you realise the schema is stored in each document, as I mentioned previously.
This will not only reduce the size of your documents, but your queries size too. In many cases, that can be a highly valuable result. You might be thinking this could be handled as an optimization by MongoDB itself, but unfortunately, that does not seem to be the case.
Suggestion 5: Avoid Sharding if Possible
While it might sound a bit controversial, after working with MongoDB for the past two years, I recommend avoiding sharding. I see two main reasons for this. Firstly, it complicates the whole set up, making maintenance a lot more complicated, increasing traffic and costs. Secondly, you need to use
mongosfor routing which is an additional point that can become a bottleneck.
In our case, this was a huge issue as several times, some of our large queries killed
mongos and caused downtime. The last disadvantage is that it is difficult to reconfigure once you use it. For example, there is no way to modify the sharding key without recreating the whole database.
The alternative to sharding is just to use multiple replica sets, each containing data keyed by some “sharding” key. The routing can just take place directly within your application. That way any performance bottlenecks are easier to debug. Obviously, this is only possible if you always access all your data scoped to the sharding key (i.e. cross-sharding key queries are not needed).
Conclusion - Optimization & Key Results
After reading this article, you might think that our MongoDB had a lot of flaws in its schema design, and you might be right. The problem is that it is tricky to avoid some flaws without major inconveniences like schema migrations, API changes or downtime.
We were able to achieve 83% data size reduction on disk and 88% in-memory size reduction. [Credit: imgix]
The issues in our database could mostly be attributed to technical debt. Unfortunately, MongoDB does not always make avoiding this easy. The migration that incorporated all the optimizations took about two weeks.
To summarize, these are the key suggestions that should help your MongoDB setup perform better:
And these are the key results we were able to achieve with those optimizations:
Document databases are still quite new compared to relational databases. In many applications, they are often opted for due to their simplicity, ease of use, performance or scalability.
However, that does not mean that you should disregard their disadvantages or skimp on your schema design. In our experience, it is very important for MongoDB as it does not provide many optimizations by itself and its performance also varies depending on the physical size of your data set.
The suggestions presented in this article are meant to be a backbone of principles that can help you achieve a lean and optimized MongoDB setup. Ultimately, it is up to you to decide what to take from this. Especially given these principles might change in the future as MongoDB and document-based databases, in general, become more mature.
Thanks for reading! Make sure to like & share this article if you like what you read.
Find our other articles here.