Monday, 6 June 2016

javascript regex oddities.


Get a JS console and try this


/./.exec('a')
>["a"]

^ This regex '.' will match the single character 'a'.

Now try with a complex unicode char like an emoji:

/./.exec('😂')
> ["�"]

The JS regex matches half of the unicode character. 

What is interesting is if you specify a 2 letter match JS finds the character:


/../.exec('😂')
>["😂"]

'😂'.length
>2

In other unrelated regex bugs: \w can not understand accents:

/\w/.exec('Ä')
> Null


Reading more about crazy Unicode in Javascript. Note that some accents can be displayed as letter followed by accent (2 characters) and that the same character can be letter_with_accent (1 character). Ofcourse if this happens the string length is different and they don't match.

People are upset about Python's handling of unicode too.

No comments:

Post a Comment