diacritical insensitive search in python

One feature that seems to be missing in the re module (or any tools that I know for searching text) is "diacritical insensitive search". I would like to have a match for something like this:

re.match("franc", "français")

in about the same whay we can have a case incensitive search:

re.match("(?i)fran", "Français").

Another related and more general problem (in the sense that it could easily be used to solve the first problem) would be to translate a string removing any diacritical mark:

nodiac("Français") -> "Francais"

The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).

1 Answer

The handling of diacriticals is especially a nice case study. One can use it to toy with some specific features of Unicode, normalisation, decomposition, ...

... and also to show how Unicode can be badly implemented.

First and quick example that came to my mind (Py325 and Py332):

timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'á¶‘á¸—á¸–á¸•á¸¹'))", "import unicodedata as ud")
[2.929404406789672, 2.923327801150208, 2.923659417064755]
timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'á¶‘á¸—á¸–á¸•á¸¹'))", "import unicodedata as ud")
[3.8437222586746884, 3.829490737203514, 3.819266963414293]

answer May 17, 2013 by anonymous

diacritical insensitive search in python

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview