About Regular Expression
About Regular Expression Engine
The Regular Expression engine we used is Microsoft Regular Expression 1.0, which is deployed inside vbscript.dll, a common dynamic link library of most Windows. Though it's very old, but it can already do most things we need.
- Fast Speed
-
Without those complex but rare-used functions, it's simple and fast, can fulfil the strict demand of speed of mass replacing. During my numerous replacing works, this engine is always fast.
- Simple Grammar
-
Second, it's grammar is simple. Comparing to Perl regexp or Java regexp, user needn't write regexp delimiter - "/" at all, which is very confusing when you use "\" to convert characters.
-
Good Compatibility
-
The Regular Expression container, vbscript.dll, exists pervasively in most Windows. So this tool can run on Most Windows, even old Win 9x. So users with any Windows, with or without .Net freamework or Java runtime, can enjoy their amazing mass replacement.
- Light Weight
-
User needn't install any additional Regexp engine. Imagine if we use a later regexp engine of .Net of Java, users should install a cumbersome .Net or Java runtime which is hundreds times in size than this tool. (Even a full-functional perl Regexp is serveral times in size than it.)
- Excellent Reliability
-
Microsoft Regular Expression 1.0 is a stable version, it's more reliable than other free third-party Regexp engines, which are always in updating or bug fixing.
There're only two versions before .Net version -- 1.0 and 5.5, there's no notable differences between them, that means version 1.0 is already very good.
During my works among various Regexp engine, I didn't meet error or hang up in replacing, which is meet in other engines some time.
Limitation of Microsoft Regular Expression 1.0
- Look behind is not supported.
- . can represent everything including \n not supported.
- ^ and $ matches beginning and end of a file, not a line.
Essential Grammar
Key Words
Metacharacters
Key |
Description |
Example |
Explanation |
^ |
start of line |
^On |
Key word "On" must be the first word of a line |
$ |
end of line |
|
|
. |
Anything except line feed |
|
One character, similar as ? in Windows wildcard |
X* |
Preceding expression exists any times or not |
(.|\n)* |
Matches nothing or anything(even paragraphs) of any length |
X+ |
Preceding expression exists once or more |
(.|\n)+ |
Anything(except nothing) of any length(>=1) |
? |
Exists once or not |
( |
Such as a,b,c or A,B,C |
(XX) | Make a group of characters, so that we can add other metacharacter behind or reference it in result | | |
[XX] | Make a class of characters, can match any characters inside. | | |
(A|B) | Similar as [AB], but A and B can be any length of characters, like word | | |
| | Either of two sides can match. | | |
[^X] | A class of all characters except the characters inside | | |
{a} | Preceding expression should be repeated a times. | | |
{a,b} | Preceding expression should be repeated from a times to b times. | | |
? | Preceding expression exists once or not. | | |
? | Match only to the nearest expression followed | | |
- Some metacharacter have different meaning at different place, such as ^ and ?. Only [^X] means a reversed meaning of [X], so here ^ actually is not an independent metacharacter, you should consider it as a part of [, but not a prefix of X
- .|\n can match any character, for . can match any character except a newline(\n), so we . or \n means any thing. Thus (.|\n)+ can match any thing of any length. You would found it's very useful when you want to match across lines. (In other typed regular expression, .|\n may not work. Some have a switch, which can make . match anycharacter including \n)
- Actually, *, +, ? are all metacharacter of times. They all can be written in {n} mode, so you can consider them simple form of some often used {n}. * equals to {0,}, + equals to {1,},? equals to {0,1}. Be careful, there's an omitted number after comma, which should be ¡Þ , in regular expression, inside {} , if the second number is omitted, it means no count limitation.
- If there are only sinle characters(including escaped characters), such as (a|b|c|d) equals to [abcd]
- To coordinate several words by |, you must embrace each of them, such as (word1)|(word2)|(word3) can not be written as word1|word2|word3 (Regular expression will read it as word(1|w)ord(2|w)ord3, very terrible, isn't it?)
- All things inside [ and ] ,should be and can only be read as single character, you can never make [ ] contain a word or a compound expression. Such as [you,me] equals to [eoumy,] which will be interpreted as (y|o|u|,|m|e)
Escaped Characters
Key |
Abbr. of |
Description |
Example |
Explanation |
\d |
Digit | means a number |
|
Such as 1,2,3 |
\b |
Boundary | means a word boundary. (Some other types of RegEx use \< and \> do the same thing.) |
\bsome\b |
Only word "some" matches, "something" or "handsome" doesn't match |
\r |
Return | means carriage return |
|
For Mac system |
\n |
Newline | means newline |
|
For Windows system |
\r\n | |
whole line break |
|
Mostly for Windows system, also used in most files for cross-platform |
\w |
Word | without a + behind, it could only be a Latin character. Match a whole word should be \b\w+\b |
|
Such as a,b,c or A,B,C |
\s |
Space | means white space, blank character, invisible character. |
|
Such as space or tab |
\t |
Tab | means white space, blank character, invisible character. |
|
Such as space or tab |
\W ... |
|
An upper case means the negative class of the lowercase one means, reversed range. |
|
|
- In regular expression, all escaped characters is case sensitive. A general expression of a class is always a backslash followed with a lowercased first character of the class name. While, an expression with a uppercased character means reversed, all things except the class marked by the lowercased one.
- Be careful, though there's two character, a back slash with a Latin character only represent one character. So any following metacharacter's subject is not the preceding Latin character, but the whole expression meaning a class with back slash escaped.
About Windows wildcard
Most users are familiar with wildcards in Windows, or MS Office, especially in MS Word, such as "*" can matches any characters, and "?" cam match any one character.
But in Regular Expression, these two symbols are arranged to other meaning, you have to use standard Regular Expression symbols or expressions to express your intention.
How to express as "*" of Windows wild card
Simply, you can use ".+" to express any characters, if you want to also match nothing, you can use ".*"
How to express as "?" of Windows wild card
Simply, you can use "." to express any character, but if you want to only match a Latin character(no symbol or number), you can use "\w".
For more reference and examples please visit
our website.