Citus Blog

Articles tagged: Postgres

Furkan Sahin

TopN for your Postgres database

Written byBy Furkan Sahin | March 27, 2018Mar 27, 2018

People seem to love lists of the most popular things. I think this is true of many of us. Including developers. Did you get all excited like I did, and listen right away to every song on the list when Spotify released Your Top Songs 2017? (Here are mine) When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?

Sometimes this problem of finding the top on a list is referred to as the Top-K problem. Also the Top "N" problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top "N" of.

Finding the top "N" is not easy

To find the top occurring item you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…

However, getting the counts of different items, sorting them and taking the top "N" of them out of your database—that can start to become much more challenging at larger scale.

Even further, what if you want to materialize your top N results for smaller sets in regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.

Inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension. And we decided to write TopN in C. TopN outputs a JSONB object which you can flexibly use for different use cases. Aggregation functions which take JSONB input and union them together are also included.

TopN can be used to calculate the most frequently occurring values in a column, and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let's look further at how the TopN extension to Postgres actually works.

Keep reading
Craig Kerstiens

Raw SQL access for users with row-level-security

Written byBy Craig Kerstiens | March 19, 2018Mar 19, 2018

We talk with a lot of SaaS companies that are encountering issues with their database. The most common issue we discuss relates to performance, either a need to keep scaling or at times just dealing with really intensive data needs of only a few customers and how to handle that.

And then as you continue to scale and capture more data you want to provide more value back to your customers.

At times you might even consider giving raw SQL access to your largest and most important customres. Typically controlling what data you give them, via dashboards and canned reports is ideal–this way you can control performance impact and other risks. But, if you have extra large/important customers that require you to give them raw access to the data... then PostgreSQL and thus Citus has your answer.

Pro-tip: Don't grant access to *all** of your customers.*

Keep reading

Data has a certain gravity and inertia. Once it's stored it's not likely to be actively moved or frequently modified. At least not for your one source of truth. Protecting that data and ensuring it's both safely stored but also correct is worth the time investment because of the value it has.

Going further, your database schema and models are going to change far less than your application code. Because it changes less frequently the case can easily be made that spending some time to ensure correctness at the database level is a great return on time.

This post was the result of a recent talk I recently gave at PgDay Paris. The conference itself was a great local event in Paris, and while there we had a chance to meet with a few of our customers based in Paris as well. As it’s always great to get out in person and chat with people about Postgres and their experience in scaling their database, many remarked that the talk could be useful to others that weren’t there. So as I thought it would be worthwhile to write-up, and here you go:

Keep reading
Craig Kerstiens

Fun with SQL: generate_series in Postgres

Written byBy Craig Kerstiens | March 14, 2018Mar 14, 2018

There are times within Postgres where you may want to generate sample data or some consistent series of records to join in order for reporting. Enter the simple but handy set returning function of Postgres: generate_series. generate_series as the name implies allows you to generate a set of data starting at some point, ending at another point, and optionally set the incrementing value. generate_series works on two datatypes:

  • integers
  • timestamps

Keep reading

If you’ve done some performance tuning with Postgres, you might have used EXPLAIN. EXPLAIN shows you the execution plan that the PostgreSQL planner generates for the supplied statement. It shows how the table(s) referenced by the statement will be scanned (using a sequential scan, index scan etc), and what join algorithms will be used if multiple tables are used. But, how does Postgres come up with these plans?

Keep reading
Marco Slot

When Postgres blocks: 7 tips for dealing with locks

Written byBy Marco Slot | February 22, 2018Feb 22, 2018

Last week I wrote about locking behaviour in Postgres, which commands block each other, and how you can diagnose blocked commands. Of course, after the diagnosis you may also want a cure. With Postgres it is possible to shoot yourself in the foot, but Postgres also offers you a way to stay on target. These are some of the important do’s and don’ts that we’ve seen as helpful when working with users to migrate from their single node Postgres database to Citus or when building new real-time analytics apps on Citus.

Keep reading
Ozgun Erdogan

Three Approaches to PostgreSQL Replication and Backup

Written byBy Ozgun Erdogan | February 21, 2018Feb 21, 2018

The Citus distributed database scales out PostgreSQL through sharding, replication, and query parallelization. For replication, our database as a service (by default) leverages the streaming replication logic built into Postgres.

When we talk to Citus users, we often hear questions about setting up Postgres high availability (HA) clusters and managing backups. How do you handle replication and machine failures? What challenges do you run into when setting up Postgres HA?

The PostgreSQL database follows a straightforward replication model. In this model, all writes go to a primary node. The primary node then locally applies those changes and propagates them to secondary nodes.

In the context of Postgres, the built-in replication (known as "streaming replication") comes with several challenges:

  • Postgres replication doesn’t come with built-in monitoring and failover. When the primary node fails, you need to promote a secondary to be the new primary. This promotion needs to happen in a way where clients write to only one primary node, and they don’t observe data inconsistencies.
  • Many Postgres clients (written in different programming languages) talk to a single endpoint. When the primary node fails, these clients will keep retrying the same IP or DNS name. This makes failover visible to the application.
  • Postgres replicates its entire state. When you need to construct a new secondary node, the secondary needs to replay the entire history of state change from the primary node. This process is resource intensive—and makes it expensive to kill nodes in the head and bring up new ones.

The first two challenges are well understood. Since the last challenge isn’t as widely recognized, we’ll examine it in this blog post.

Keep reading
Marco Slot

PostgreSQL rocks, except when it blocks: Understanding locks

Written byBy Marco Slot | February 15, 2018Feb 15, 2018

On the Citus open source team, we engineers take an active role in helping our users scale out their Postgres database, be it for migrating an existing application or building a new application from scratch. This means we help you with distributing your relational data model—and also with getting the most out of Postgres.

One problem I often see users struggle with when it comes to Postgres is locks. While Postgres is amazing at running multiple operations at the same time, there are a few cases in which Postgres needs to block an operation using a lock. You therefore have to be careful about which locks your transactions take, but with the high-level abstractions that PostgreSQL provides, it can be difficult to know exactly what will happen. This post aims to demystify the locking behaviors in Postgres, and to give advice on how to avoid common problems.

Keep reading
Nate Barbettini

Multi-tenant web apps with ASP.NET Core and Postgres

Written byBy Nate Barbettini | January 22, 2018Jan 22, 2018

When it comes to building large-scale, multi-tenant applications, Microsoft's ASP.NET platform is a strong choice. Like other popular web frameworks such as Express and Django, ASP.NET is used to build web applications and APIs. It's been around for a while, but don't let that fool you: ASP.NET packs some serious muscle. After all, it powers one of the biggest Q&A networks on the web: Stack Exchange!

In the past, ASP.NET apps could only run on Windows servers. That's changed with the latest version, ASP.NET Core, which is fully open source and cross-platform. ASP.NET Core runs anywhere you need it to (Windows, Mac, Linux, Docker) and features a modern middleware pipeline, a rich package ecosystem, and blazing-fast performance.

My experience working on multi-tenant enterprise apps has taught me that it's never too early to design for scale. How you architect your code matters, as does how you architect your data. In the past, the apps I worked on were designed around a database-per-tenant model—unfortunately, the database-per-tenant model didn’t scale and caused problems once our app reached thousands of customers (aka tenants). In this post, I’ll show you a different approach to scale the underlying database with ASP.NET: sharding. With sharding you can leave behind the drawbacks of the database-per-tenant model and can scale infinitely.

In this blog post, I'll show you how to build your multi-tenant app with scale in mind. You'll learn how to use ASP.NET Core's middleware pipeline plus the sharding features of Postgres and Citus to build a scalable multi-tenant application on ASP.NET Core. Along the way we’ll start to build the MVP of our very own StackExchange. Let's get started!

Keep reading
Craig Kerstiens

Database sharding explained in plain English

Written byBy Craig Kerstiens | January 10, 2018Jan 10, 2018

Sharding is one of those database topics that most developers have a distant understanding of, but the details aren't always perfectly clear unless you've implemented sharding yourself. In building the Citus database (our extension to Postgres that shards the underlying database), we've followed a lot of the same principles you'd follow if you were manually sharding Postgres yourself. The main difference of course is that with Citus, we’ve done the heavy lifting to shard Postgres and make it easy to adopt, whereas if you were to shard at the application layer then there’s a good bit of of work needed to re-architect your application.

I've found myself explaining how sharding works to many people over the past year and realized it would be useful (and maybe even interesting) to break it down in plain English.

Keep reading

Page 11 of 15