Unicode processing in Python

By Andrew Svetlov

Elevator Pitch

Maybe you know how to work with codecs like UTF-8 and store unicode strings in files/databases. But what’s about complex tasks? How to normalize strings, write correct regular expression patterns and do other text processing? At the end: how many punctuation characters is exist?

Description

Everybody uses Unicode nowdays. At least for emojies in slack or twitter 🤓

Saving data with UTF-8 encoding and reading it back is well known procedure. What’s about more complex challenges?

  1. Unicode, codepoints and byte strings. What every thing exists is for?
  2. Converting unicode from bytes. Error modes, codecs etc. Source encoding autodetection as a bonus.
  3. UTF-16, Little/Big Endian, Surrogate Pairs. What casual software developer should know about.
  4. Unicode Composites and their normalization.
  5. Unicode Categories, API for working with them and internationalized regular expressions.

I’m working in https://ocean.io/ We parse very many internet pages in the wild and extract useful information from them. The talk is based on our experience in this area.