When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such

By Kim Berg Hansen

Elevator Pitch

A-Z in upper and lower case isn’t enough in a global world, you need to support all sorts of textual data. Oracle database can handle most of it, but there are pitfalls to be aware of - even using Unicode that should handle everything. Learn about NLS and collation and converting charactersets.

Description

How life as a DB developer would be much easier if the entire world was happy with just A-Z in upper and lower case. But the globalized world is a very diverse place with myriads of special characters and modifiers, different alphabets, writing directions, sorting and comparison rules, and many ways to encode these characters. So your database and application absolutely needs to know about these things and handle them correctly, or you’ll find your app displaying garbage or worse, storing garbage.

Oracle database offers many NLS (National Language Support) settings to handle such globalization on database, instance and session/client level. NLS functions are available in the SQL language to handle specific cases with code when the session settings aren’t granular enough. In Oracle 12.2 was added Collation support, where you specify right down on column level how this column should sort and compare values. You have choices whether to store everything with database characterset UTF8 or stick to a single-byte characterset - in both cases be aware of how conversion takes place if the client is not using the database characterset.

This presentation discusses many of the issues involved and ways to use NLS and Collation to avoid them. Pitfalls will be shown, where you can get mangled (or even invalid) data stored if you are not careful.

Notes

Topics covered include:

  • Character sets and encodings. ASCII, ISO, UTF and more.
  • Database character set - what can be used for identifiers.
  • National character set - alternative for using Unicode in DB with single-byte DB charset.
  • NLS settings of the database versus instance versus session (and client.)
  • Plug-compatible charsets in multi-tenant environment.
  • Circumstances leading to invalid characters in the database.
  • BYTE versus CHAR length semantics.
  • Linguistic sorting and matching - NLS_SORT, NLS_COMP.
  • Data-bound collation - defaults, column level and PL/SQL units.
  • Database Migration Assistant for Unicode (DMU) for changing DB charset.