Choosing a Database Schema for Polymorphic Data (2024)

41 points by gm678 13 hours ago

bob1029 27 minutes ago

JSON can work surprisingly well as a blob column up to a point. The part that is interesting is that point varies dramatically depending on where and how that blob is being serialized.

If you are storing json blobs in SQLite and using a very fast serializer (gigabytes/s), then anything under a megabyte or so won't really show up on a hot path. Updates to complex entities can actually be faster, even if you're burning more IO and device wear.

If you need to join across properties in the JSON, I wouldn't use JSON. My canary is if I find myself using the built in json query functionality, I am too far into noSQL Narnia.

fmjrey 3 hours ago

The article reads like a story of trying to fit a square peg in a round hole, discussing pros and cons of cutting the square corners vs using a bigger hole. At some point one needs to realize we're using the wrong kind of primitives to build the distributed systems of today. In other words, we've reached the limit of the traditional approach based on OO and RDBMS that used to work with 2 and 3-tier systems. Clearly OO and RDBMS will not get us out of the tar pit. FP and NoSQL came to the rescue, but even these are not enough to reduce the accidental complexity of building distributed systems with the kind of volume, data flows, and variability of data and use cases.

I see two major sources of inspiration that can help us get out of the tar pit.

The first is the EAV approach as embodied in databases such as Datomic, XTDB, and the like. This is about recognizing that tables or documents are too coarse-grained and that entity attribute is a better primitive for modeling data and defining schemas. While such flexibility really simplifies a lot of use cases, especially the polymorphic data from the article, the EAV model assumes data is always about an entity with a specific identity. Once again the storage technology imposes a model that may not fit all use cases.

The second source of inspiration, which I believe is more generic and promising, is the one embodied in Rama from Red Planet Labs, which allows for any data shape to be stored following a schema defined by composing vectors, maps, sets, and lists, and possibly more if custom serde are provided. This removes the whole impedance mismatch issue between code and data store, and embraces the fact that normalized data isn't enough by providing physical materialized views. To build these, Rama defines processing topologies using a dataflow language compiled and run by a clustered streaming engine. With partitioning being a first-class primitive, Rama handles the distribution of both compute and data together, effectively reducing accidental complexity and allowing for horizontal scaling.

The difficulty we face today with distributed systems is primarily due to the too many moving parts of having multiple kinds of stores with different models (relational, KV, document, graph, etc.) and having too many separate compute nodes (think microservices). Getting out of this mess requires platforms that can handle the distribution and partitioning of both data and compute together, based on powerful primitives for both data and compute that can be combined to handle any kind of data and volumes.

1st1 6 hours ago

FWIW polymorphic schema and queries are trivial in Gel https://docs.geldata.com/reference/edgeql/select#polymorphic...

oulipo 5 hours ago

Is gel mature enough to be used in prod? what's the main advantage it has over postgres?

hot_gril 10 hours ago

Definitely don't want to store types as columns in a DB, especially because of the inevitable thing that qualifies as two different types. In this situation, I'd usually take the first one (nullable cols) without much consideration. The DB doesn't need an xor constraint, but it can if you really want. New cols can be added without much impact on existing data.

And if the info is non-scalar, it's either option 2 (nullable FK) or 5 (JSON), depending on whether or not other things join with fields inside it.

hobs 6 hours ago

First one gets messier faster than most, the fourth one generally grows the least crappily over time.

acquiesce 10 hours ago

I wouldn’t do this personally because the downstream code very often has to handle differences where polymorphism breaks and you end up having to query the type. Polymorphism shouldn’t be used for data, only behavior, and only in very specific circumstances. Subclassing is a different topic.

setr 9 hours ago

You wouldn’t do what? Have polymorphic data to begin with? I don't see how you can choose to avoid the scenario that record A has one of several possible related metadata, other than just ignoring it and allowing invalid representations
- ozgrakkurt 6 hours ago
  
  You can have different tables for different data. You don’t have to put all in same table
IceDane 8 hours ago

Want to elaborate on how you're going to magically disappear the inherent polymorphism in your problem domain every time?
Sometimes you can indeed view things from a different perspective and come up with a simpler data model, but sometimes you just can't.

4b11b4 8 hours ago

I learned a lot in a short time, thanks

feitico 6 hours ago

Just use NoSQL

MongoDB is great for this

dinfinity an hour ago

It's web scale!

j45 6 hours ago

I’m unsure why one wouldn’t use a polymorphic database for polymorphic data, instead of the gymnastics of bending a relational db.

toolslive 6 hours ago

approach # 6: Column Store?