Tabular Data – Summer of Code

Parquet.jl enhancements and JuliaDB

Apache Parquet is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. JuliaDB (submodule MemPool) currently requires a binary format for efficient storage and data transfer, but right now resorts to a custom but fast implementation. Users are asked not to take it seriously because it breaks from release to release. Having a Parquet reader and writer will solve this problem by standardizing the format. Prior work includes Parquet.jl which only has a Parquet reader. Having written a basic Parquet reader and writer, you will need to shift your focus to performance-oriented array types in JuliaDB: namely PooledArrays, and StringArrays (from WeakRefStrings.jl), StructArrays, and finally tables. You will also need to make sure that bits-types such as Dates, Rational numbers etc. are efficiently stored and memory-mapped on load. Then you will make Parquet the default format for loading, saving and (possibly) communicating data between processes in JuliaDB. By doing this project you will learn about the performance engineering of a distributed, out-of-core analytical database.

Mentors: Shashi Gowda, Tanmay Mohapatra

GPU support in JuliaDB

JuliaDB is a distributed analytical database. It uses Julia’s multi-processing for parallelism at the moment. GPU implementations of some operations may allow relational algebra with low latency. In this project, you will be required to add basic GPU support in JuliaDB.

Copy a table to GPU – this may be as simple as converting every column into a CuArray or GPUArray
map, reduce and filter operation – apply simple functions on a large table that is on the GPU
- Ensure that columnar storage format is made use of in the lower level code generated.
The groupby and join operations may involve first implementing an efficient sortperm that utilize the GPU, or an efficient hash table on the GPU
groupby kernel on GPU
join kernel on GPU (stretch goal)

Mentors: Shashi Gowda, Mike Innes

A columnar query processing and optimization backend for Query.jl

Query.jl is designed to work with multiple backends. This project would add a backend for columnar sources that implements many of the optimizations that the database literature on column oriented query processing has identified.

Recommended Skills: Very strong database design knowledge, familiarity with the Julia data stack and excellent Julia knowledge.

Expected Results: A new backend for Query.jl that runs queries against columnar stores in an optimized way.

Mentors: David Anthoff

Tabular file IO

The Queryverse has a large number of file IO packages: CSVFiles.jl, ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl, ParquetFiles and FstFiles.jl. This project will a) do serious performance work across all of the existing packages and b) add write capabilities to a number of them.

Recommended Skills: Experience with file formats, writing performant julia code.

Expected Results: Write capabilities across the packages listed above, competitive performance for all the packages listed above.

Mentors: David Anthoff