[ad_1]
- UPM is our inner standalone library to carry out static evaluation of SQL code and improve SQL authoring.
- UPM takes SQL code as enter and represents it as a knowledge construction known as a semantic tree.
- Infrastructure groups at Meta leverage UPM to construct SQL linters, catch consumer errors in SQL code, and carry out information lineage evaluation at scale.
Executing SQL queries towards our information warehouse is essential to the workflows of many engineers and information scientists at Meta for analytics and monitoring use instances, both as a part of recurring information pipelines or for ad-hoc information exploration.
Whereas SQL is extraordinarily highly effective and very fashionable amongst our engineers, we’ve additionally confronted some challenges through the years, specifically:
- A necessity for static evaluation capabilities: In a rising variety of use instances at Meta, we should perceive programmatically what occurs in SQL queries earlier than they’re executed towards our question engines — a process known as static evaluation. These use instances vary from efficiency linters (suggesting question optimizations that question engines can not carry out mechanically) and analyzing information lineage (tracing how information flows from one desk to a different). This was exhausting for us to do for 2 causes: First, whereas question engines internally have some capabilities to research a SQL question as a way to execute it, this question evaluation element is usually deeply embedded contained in the question engine’s code. It’s not straightforward to increase upon, and it isn’t supposed for consumption by different infrastructure groups. Along with this, every question engine has its personal evaluation logic, particular to its personal SQL dialect; consequently, a workforce who desires to construct a bit of study for SQL queries must reimplement it from scratch inside of every SQL question engine.
- A limiting kind system: Initially, we used solely the fastened set of built-in Hive information varieties (string, integer, boolean, and many others.) to explain desk columns in our information warehouse. As our warehouse grew extra complicated, this set of varieties grew to become inadequate, because it left us unable to catch widespread classes of consumer errors, corresponding to unit errors (think about making a UNION between two tables, each of which include a column known as timestamp, however one is encoded in milliseconds and the opposite one in nanoseconds), or ID comparability errors (think about a JOIN between two tables, every with a column known as user_id — however, in reality, these IDs are issued by totally different techniques and due to this fact can’t be in contrast).
How UPM works
To deal with these challenges, now we have constructed UPM (Unified Programming Mannequin). UPM takes in an SQL question as enter and represents it as a hierarchical information construction known as a semantic tree.
For instance, should you move on this question to UPM:
SELECT
COUNT(DISTINCT user_id) AS n_users
FROM login_events
UPM will return this semantic tree:
SelectQuery(
gadgets=[
SelectItem(
name="n_users",
type=upm.Integer,
value=CallExpression(
function=upm.builtin.COUNT_DISTINCT,
arguments=[ColumnRef(name="user_id", parent=Table("login_events"))],
),
)
],
mum or dad=Desk("login_events"),
)
Different instruments can then use this semantic tree for various use instances, corresponding to:
- Static evaluation: A software can examine the semantic tree after which output diagnostics or warnings in regards to the question (corresponding to a SQL linter).
- Question rewriting: A software can modify the semantic tree to rewrite the question.
- Question execution: UPM can act as a pluggable SQL entrance finish, that means {that a} database engine or question engine can use a UPM semantic tree on to generate and execute a question plan. (The phrase entrance finish on this context is borrowed from the world of compilers; the entrance finish is the a part of a compiler that converts higher-level code into an intermediate illustration that may finally be used to generate an executable program). Alternatively, UPM can render the semantic tree again right into a goal SQL dialect (as a string) and move that to the question engine.
A unified SQL language entrance finish
UPM permits us to supply a single language entrance finish to our SQL customers in order that they solely have to work with a single language (a superset of the Presto SQL dialect) — whether or not their goal engine is Presto, Spark, or XStream, our in-house stream processing service.
This unification can be helpful to our information infrastructure groups: Because of this unification, groups that personal SQL static evaluation or rewriting instruments can use UPM semantic timber as an ordinary interop format, with out worrying about parsing, evaluation, or integration with totally different SQL question engines and SQL dialects. Equally, very like Velox can act as a pluggable execution engine for information administration techniques, UPM can act as a pluggable language entrance finish for information administration techniques, saving groups the trouble of sustaining their very own SQL entrance finish.
Enhanced type-checking
UPM additionally permits us to supply enhanced type-checking of SQL queries.
In our warehouse, every desk column is assigned a “physical” kind from a set checklist, corresponding to integer or string. Moreover, every column can have an non-obligatory user-defined kind; whereas it doesn’t have an effect on how the information is encoded on disk, this kind can provide semantic info (e.g., E mail, TimestampMilliseconds, or UserID). UPM can make the most of these user-defined varieties to enhance static type-checking of SQL queries.
For instance, an SQL question creator would possibly wish to UNION information from two tables that include details about totally different login occasions:
Within the question on the appropriate, the creator is making an attempt to mix timestamps in milliseconds from the desk user_login_events_mobile with timestamps in nanoseconds from the desk user_login_events_desktop — an comprehensible mistake, as the 2 columns have the identical identify. However as a result of the tables’ schema have been annotated with user-defined varieties, UPM’s typechecker catches the error earlier than the question reaches the question engine; it then notifies the creator of their code editor. With out this test, the question would have accomplished efficiently, and the creator may not have seen the error till a lot later.
Column-level information lineage
Knowledge lineage — understanding how information flows inside our warehouse and thru to consumption surfaces — is a foundational piece of our information infrastructure. It allows us to reply information high quality questions (e.g.,“This data looks incorrect; where is it coming from?” and “Data in this table were corrupted; which downstream data assets were impacted?”). It additionally helps with information refactoring (“Is this table safe to delete? Is anyone still depending on it?”).
To assist us reply these vital questions, our information lineage workforce has constructed a question evaluation software that takes UPM semantic timber as enter. The software examines all recurring SQL queries to construct a column-level information lineage graph throughout our total warehouse. For instance, given this question:
INSERT INTO user_logins_daily_agg
SELECT
DATE(login_timestamp) AS day,
COUNT(DISTINCT user_id) AS n_users
FROM user_login_events
GROUP BY 1
Our UPM-powered column lineage evaluation would deduce these edges:
[
from: “user_login_events.login_timestamp”,
to: “user_login_daily_agg.day”,
transform: “DATE”
,
from: “user_login_events.user_id”,
to: “user_logins_daily_agg.n_user”,
transform: “COUNT_DISTINCT”
]
By placing this info collectively for each question executed towards our information warehouse every day, the software exhibits us a worldwide view of the total column-level information lineage graph.
What’s subsequent for UPM
We stay up for extra thrilling work as we proceed to unlock UPM’s full potential at Meta. Finally, we hope all Meta warehouse tables might be annotated with user-defined varieties and different metadata, and that enhanced type-checking might be strictly enforced in each authoring floor. Most tables in our Hive warehouse already leverage user-defined varieties, however we’re rolling out stricter type-checking guidelines regularly, to facilitate the migration of present SQL pipelines.
We’ve got already built-in UPM into the principle surfaces the place Meta’s builders write SQL, and our long-term objective is for UPM to grow to be Meta’s unified SQL entrance finish: deeply built-in into all our question engines, exposing a single SQL dialect to our builders. We additionally intend to iterate on the ergonomics of this unified SQL dialect (for instance, by permitting trailing commas in SELECT clauses and by supporting syntax constructs like SELECT * EXCEPT
[ad_2]
Source link