Written byByOzgun Erdogan | August 9, 2017Aug 9, 2017
When your database is small (10s of GB), it's easy to throw more hardware at the problem and scale up. As these tables grows however, you need to think about other ways to scale your database.
In one way, sharding is the best way to scale. Sharding enables you to linearly scale your database’s cpu, memory, and disk resources by separating your database into smaller parts. In other ways, sharding is a controversial topic. The internet is full of advice on sharding, from "essential to scaling your database infrastructure" to "why you never want to shard". So the question is, whose advice should you take?
Having a database staging environment that is as close to production as possible is key to being able to test your app. This applies to both your code and to your database. Far too often a staging database is a forgotten child in your stack—not getting the same love and attention as your production instance. For some teams, their staging database is years old, or worse yet, their staging database is a 10 GB sample of a 2 TB production database.
What if you could easily have a full staging environment to experiment with, that is an exact copy of your production database? Even if that production database is 50 TB?
As of today on Citus Cloud—our fully-managed database as a service that is built to scale-out (and based on Postgres!)—you can get a full fork of your production database with the click of a button.
Citus is Postgres that scales out horizontally. We do this by distributing queries across multiple Postgres servers—and as is often the case with scale-out architectures, this scale-out approach provides some great performance gains. And because Citus is an extension to Postgres, you get all the awesome features in Postgres such as support for JSONB, full-text search, PostGIS, and more.
The distributed nature of the Citus extension gives you new flexibility when it comes to modeling your data. This is good. But you’ll need to think about how to model your data and what type of database tables to use. The way you query your data ultimately determines how you can best model each table. In this post, we'll dive into the three different types of database tables in Citus and how you should think about each.
As a developer your CLI is your home. You spend a lifetime of person-years in your shell and even small optimizations can pay major dividends to your efficiency. For anyone that works with Postgres and likely the psql editor, you should consider investing some love and care into psql. A little known fact is that psql has a number of options you can configure it with, and these configuration options can all live within an rc file called psqlrc in your home directory. Here is my .psqlrc file, which I've customized to my liking. Let’s walk through some of the commands within my .psqlrc file:
Just as Heroku has made it simple for you to deploy applications, at Citus Data we aim to make it simple for you to scale out your Postgres database.
Once upon a time at Heroku, it all started with git push heroku master. Later, the team at Heroku made it easy to add any service you could want within your app via heroku addons:create foo. The simplicity of dragging a slider to scale up your dynos is the type of awesome customer experience we strive to create at Citus. With Citus Cloud (our fully-managed database as a service), you can simply drag and save—then voila, you've scaled the resources to your database.
Written byByBurak Yucesoy | June 30, 2017Jun 30, 2017
HyperLogLog is an awesome approximation algorithm that addresses the distinct count problem. I am a big fan of HyperLogLog (HLL), so much so that I already wrote about the internals and how HLL solves the distributed distinct count problem. But there’s more to talk about, including HLL with rollup tables.
Rollup Tables and Postgres
Rollup tables are commonly used in Postgres when you don’t need to perform detailed analysis, but you still need to answer basic aggregation queries on older data.
With rollup tables, you can pre-aggregate your older data for the queries you still need to answer. Then you no longer need to store all of the older data, rather, you can delete the older data or roll it off to slower storage—saving space and computing power.
Let’s walk through a rollup table example in Postgres without using HLL.
"Your father's lightsaber. This is the weapon of a Jedi Knight. Not as clumsy or random as a blaster. An elegant weapon, for a more civilized age."
—Obi-Wan Kenobi, Star Wars Episode IV: A New Hope
Announcing the release of Citus 6.2
Today I’m happy to announce that we’ve rolled out a new version of our database, Citus 6.2. Because as most of you know, good software never stops evolving. Nor should it. If you want the scoop on the new capabilities in Citus 6.2, just scroll ahead. But before diving in, I need to explain the lightsaber pic. Why? Because usually a picture speaks a thousand words, but sometimes it needs an annotation. :-)
When my colleagues first started on their journey to build Citus, they had a vision of combining the best aspects of relational databases with the elastic scale of NoSQL—to give developers a database that delivers SQL capabilities, at scale.
But vision alone does not make a successful company. The Citus co-founders needed a mix of key ingredients: the right team, good timing, good execution, a willingness to experiment and learn, plus (of course) a good idea.
When George Lucas describes his days before the first Star Wars film, he said he was “searching for just the right ingredients, characters and storyline.” In Lucas’s search for the right mix, he too had to iterate: he wrote four different screenplays before landing on the final version of the original film!
Because our CTO is such a big fan of Star Wars, Ozgun sometimes talks about his vision for Citus in the language of the Jedi: Ozgun has said his aim for Citus was “to create a database as elegant and as powerful as a lightsaber.” Now, I’m more of a Stranger Things fan myself (after all, mornings are for coffee and contemplation) but I get Ozgun’s desire to create a database that gives you the benefits of SQL—at scale.
Distributed databases often require you to give up SQL and ACID transactions as a trade-off for scale. Citus is a different kind of distributed database. As an extension to PostgreSQL, Citus can leverage PostgreSQL’s internal logic to distribute more sophisticated data models. If you’re building a multi-tenant application, Citus can transparently scale out the underlying database in a way that allows you to keep using advanced SQL queries and transaction blocks.
In multi-tenant applications, most data and queries are specific to a particular tenant. If all tables have a tenant ID column and are distributed by this column, and all queries filter by tenant ID, then Citus supports the full SQL functionality of PostgreSQL—including complex joins and transaction blocks—by transparently delegating each query to the node that stores the tenant’s data. This means that with Citus, you don’t lose any of the functionality or transactional guarantees that you are used to in PostgreSQL, even though your database has been transparently scaled out across many servers. In addition, you can manage your distributed database through parallel DDL, tenant isolation, high performance data loading, and cross-tenant queries.
Note: This is a guest blog post by Giuseppe “Pino” de Candia, the creator of Dynamo. We asked Pino to chime in with his thoughts on distributed databases and the trends he sees in this space. You can read more about Pino here.
When Ozgun, one of the founders of Citus Data, emailed me resources on scaling multi-tenant databases for B2B apps and asked me what I thought, all kinds of distributed systems tradeoffs started crossing my mind—along with memories of the forces that shaped Dynamo.
It’s been a decade since my team at Amazon worked on Dynamo, a highly available and scalable key-value store. By the time we started working on the project, Amazon was already going through two transitions.
There are a number of applications out there that have a high number of connections to Postgres. What's high? That all depends on your application, but generally when you get to the few hundred connection area in Postgres you're in the higher end. Anything in the thousands is definitely in the high territory, and even several hundred can put strain on your application. Generally a safe level for connections should be somewhere around 300-500 connections. This may seem low if you're already running with thousands of connections, but it's likely perfectly fine with pgBouncer taking care of the heavy lifting for you. Let's drill into why a bit further.