What’s the correct regex range for javascript’s regexes to match all the non word characters in any script?

Posted on

What’s the correct regex range for javascript’s regexes to match all the non word characters in any script? – Even if we have a good project plan and a logical concept, we will spend the majority of our time correcting errors abaout javascript and regex. Furthermore, our application can run without obvious errors with JavaScript, we must use various ways to ensure that everything is operating properly. In general, there are two types of errors that you’ll encounter while doing something wrong in code: Syntax Errors and Logic Errors. To make bug fixing easier, every JavaScript error is captured with a full stack trace and the specific line of source code marked. To assist you in resolving the JavaScript error, look at the discuss below to fix problem about What’s the correct regex range for javascript’s regexes to match all the non word characters in any script?.

Problem :

In python or PHP a simple regex such as /W/gu matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_], what are the correct ranges to match the same characters as python and PHP?

https://regex101.com/r/yhNF8U/1/

Solution :

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware W will look like:

[^p{Alphabetic}p{Mark}p{Decimal_Number}p{Connector_Punctuation}p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The
Connector Punctuation is added in for programming language
identifiers, thus adding “_” and similar characters.

More considerations

The w construct (and thus its W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: W .NET definition: [^p{Ll}p{Lu}p{Lt}p{Lo}p{Nd}p{Mn}p{Pc}p{Lm}], where p{Ll}p{Lu}p{Lt}p{Lo} can be contracted to a sheer p{L} and the pattern is thus equal to [^p{L}p{Nd}p{Mn}p{Pc}].

In Android (see documentation), [^p{Alpha}p{gc=Mn}p{gc=Me}p{gc=Mc}p{Digit}p{gc=Pc}p{IsJoin_Control}], where p{gc=Mn}p{gc=Me}p{gc=Mc} can be just written as p{M}.

In PHP PCRE, W matches [^p{L}p{N}_].

Rexegg cheat sheet defines Python 3 w as “Unicode letter, ideogram, digit, or underscore“, i.e. [p{L}p{Mn}p{Nd}_].

You may roughly decompose W as [^p{L}p{N}p{M}p{Pc}]:

/[^p{L}p{N}p{M}p{Pc}]/gu

where

  • [^ – is the start of the negated character class that matches a single char other than:
    • p{L} – any Unicode letter
    • p{N} – any Unicode digit
    • p{M} – a diacritic mark
    • p{Pc} – a connector punctuation symbol
  • ] – end of the character class.

Note it is p{Pc} class that matches an underscore.

NOTE that p{Alphabetic} (p{Alpha}) includes all letters matched by p{L}, plus letter numbers matched by p{Nl} (e.g. – a character for the roman number 12), plus some other symbols matched with p{Other_Alphabetic} (p{OAlpha}).

Other variations:

  • /[^p{L}0-9_]/gu – to just use W that is aware of Unicode letters only
  • /[^p{L}p{N}_]/gu – (PCRE W style) to just use W that is aware of Unicode letters and digits only.

Note that Java’s (?U)W will match a mix of what W matches in PCRE, Python and .NET.

Leave a Reply

Your email address will not be published. Required fields are marked *