Enhanced String handling in Julia

By Scott P. Jones

Elevator Pitch

String performance is important in many areas, such as parsing text formats such as JSON, CSV, or bioinformatics data, for NLP (natural language processing), and for interfacing with other languages/libraries that use other encodings than UTF-8. This talk will discuss the JuliaString “ecosystem”.

Description

JuliaString is a new organization, formed with the goal of creating an ecosystem of standards compliant, fast, and easy to use string handling packages. Many people using Julia are scientists, and don’t really wish to deal with the complexities with things like indexing into variable-length string encodings, why their data loaded from a CSV file or from a database sometimes seem to be corrupted. There are also people interesting in writing Julia code for NLP (natural language processing), where the speed can be critical. Many fields such as bioinformatics have their own text file formats, and the new Strs package provides easier to use functions for writing fast string handling code. The talk will cover the added functionality of the Strs package, the importance of having validated strings for avoiding a number of known security exploits, the relative performance of the new strings, and its use for interfacing with other languages, databases, and libraries that frequently use other Unicode encodings, such as UTF-16. It will also cover the string literal format, with support for LaTeX, Unicode, HTML entities and Emojis, as well as easy to use formatted output that builds upon string interpolation, using the Julia type system, or alternatively using a C-like or Python-like format code syntax (but without having to count positions, or use a macro. Other topics such as thread-safe regex support, handling non-Unicode string encodings, optimized dictionaries for string keys, will be discussed, time permitting.

Notes

I’ve been involved with natural language processing, architecting and writing full string libraries to handle national language support first with 8-bit and multi-byte character sets, and then architected the entire Unicode support for a heavily used ANSI standard language / database system, which is heavily string oriented, as well as contributing to string support in Julia over the past three years, using Julia full-time while consulting for a start-up doing natural language processing to handle unstructured and semistructured data.