User-defined functions (UDFs) are bits of application-specific logic which can be invoked within database queries, extending the builtin capabilities of the database. You might be wondering why not keep application-specific logic in the application layer? Are user-defined functions (UDFs) the relic of a past when databases were designed to be directly queried by end-users?
Far from it! Placing domain-specific logic in the database is often the best (sometimes only) way to implement functionality such as:
AVG which must run within the database to avoid shipping the entire set of input rows to the client to calculate output.
UDFs are an essential building block supported by the majority of databases one way or another, and in this regard Seafowl is no exception. What makes Seafowl special is its commitment to running "at the edge": as close to the user as possible, which means pretty much anywhere. But if Seafowl is to run anywhere, UDFs referenced by Seafowl queries should be just as portable. How can Seafowl offer UDF authors a platform which makes this possible?
Like any program, a UDF targets a specific platform. This platform should be hardware-independent so UDFs keep working if the database is moved to a different machine, but also efficient enough to process large datasets in acceptable time.
Mature databases often have a custom language for defining UDFs such as PostgreSQL's PL/PGSQL or PL/SQL. For simple functions just beyond the capabilities of an SQL statement, these languages offer an ideal solution. For more complex functionality or code with lots of dependencies, many developers choose a general programming language, preferably one used in the application layer.
Custom logic written in compiled languages such as C++, Rust or Java is more difficult to invoke from a native database process, requiring either running the UDF in a separate process or compiling against the C ABI of the database and executing it with full access to the database. The security implications of the second option rule out this approach for UDFs.
Inter-process communication (IPC) with a separate JVM process provides a portable solution supported by Oracle and IBM DB2, but introduces a performance penalty for each UDF invocation along with additional corner cases arising from the need to synchronize a JVM process with the database. For example, should updating the Java bytecode for a UDF impact running transactions? If not, how many versions of the UDF must be kept in memory and how should the database keep track of which is the current version in each transaction? When is it safe to remove classes upon which a previous version of the UDF depends? If a transaction defining a UDF is rolled back, which classes' bytecode needs to be ejected from the JVM? The list goes on and on, and it's mostly independent of Java, stemming instead from the marriage of a transactional database and another process which dynamically loads UDF code.
To summarize, there's hardly a silver bullet: each solution for providing UDFs has its pros and cons. Worse, none of the solutions considered so far are truly portable in the sense that they can be used in the browser, on dedicated servers and desktops without serious limitations.
Predating Seafowl and Splitgraph by decades, developers of large native applications faced a similar conundrum. As new devices and platforms appeared, customers wanted to run their applications in previously unsupported environments. Porting existing applications to new platforms was the traditional remedy, a slow and expensive process.
In the 90s, Java tried to address this problem, promising developers they could "write once, run everywhere". As Java eventually found success in the very different niches of backend services and Android devices, its early focus on "running everywhere" understandably changed to working best where it's used.
Even green field projects struggled to find a stack sufficiently portable and efficient to ensure the application's long-term growth in desktop, mobile and browser environments.
Instead, the WebAssembly specification provides a limited set of low-level instructions and linear memory model. Thanks to the relative ease of implementation, today there are many high quality implementations, allowing code targeting WebAssembly to run anywhere from embedded devices to browsers, CDNs, and yes- in databases as well.
How does WebAssembly as a platform for implementing UDFs compare to scripting languages like Lua or more traditional compiled solutions like Java or C? Since WebAssembly is designed to be used in very diverse environments, it cannot afford to make many assumptions:
stdout or environment variables unless explicitly provided via functions that the WASM module can call at runtime.
That last point presents a significant challenge when using WebAssembly for authoring UDFs. Many of the types supported by Seafowl such as
cannot be simply represented solely by scalar numeric values.
One option is to limit UDFs to consuming and returning types natively supported by WebAssembly (this is ScyllaDB's approach).
For Seafowl's UDFs, we wanted users to be able to work with any datatype including strings, decimal numbers and dates.
In WebAssembly, memory is just a sequence of bytes and pointers are just 32 bit integers, similar to an array index. Any object which can be serialized to a byte stream can be passed by the host to a WebAssembly function by first writing the serialized bytes to WASM memory, then invoking the function with a pointer to the freshly written memory address as its argument. Similarly, the WASM function can serialize it's output as a byte stream to which it returns a pointer.
Seafowl uses MessagePack for serialization - think of it like a faster binary version of JSON.
Today, Seafowl supports creating UDFs in any programming language which compiles to WASM. Check out our Rust example UDF repository! Any Seafowl type may be used for input values and return type.
One of the biggest advantages of using UDFs is aggregated records within the database. As of 2022 December, Seafowl does not yet support user-defined aggregation functions, but this support is planned. When it does arrive, you can be sure it will also use WebAssembly!