PCRE正则语法
在线手册:中文 英文
PHP手册

Unicode字符属性

自从PHP 4.4.0和5.1.0, 三个额外的转义序列在选用UTF-8模式时用于匹配通用字符类型. 他们是:

\p{xx}
一个有属性xx的字符
\P{xx}
一个没有属性xx的字符
\X
一个扩展的Unicode字符

上面xx代表的属性名用于限制Unicode通常的类别属性. 每个字符都有一个这样的确定的属性, 通过两个缩写的字母指定. 为了与perl兼容, 可以在左花括号{后面增加^表示取反. 比如: \p{^Lu}就等同于\P{Lu}

如果通过\p\P仅指定了一个字母, 它包含所有以这个字母开头的属性. 在这种情况下, 花括号的转义序列是可选的.

\p{L}
\pL
支持的Unicode属性
Property Matches Notes
C Other  
Cc Control  
Cf Format  
Cn Unassigned  
Co Private use  
Cs Surrogate  
L Letter Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll Lower case letter  
Lm Modifier letter  
Lo Other letter  
Lt Title case letter  
Lu Upper case letter  
M Mark  
Mc Spacing mark  
Me Enclosing mark  
Mn Non-spacing mark  
N Number  
Nd Decimal number  
Nl Letter number  
No Other number  
P Punctuation  
Pc Connector punctuation  
Pd Dash punctuation  
Pe Close punctuation  
Pf Final punctuation  
Pi Initial punctuation  
Po Other punctuation  
Ps Open punctuation  
S Symbol  
Sc Currency symbol  
Sk Modifier symbol  
Sm Mathematical symbol  
So Other symbol  
Z Separator  
Zl Line separator  
Zp Paragraph separator  
Zs Space separator  

“Greek”, “InMusicalSymbols”等扩展属性在PCRE中不支持

指定大小写不敏感匹配对这些转义序列不会产生影响, 比如, \p{Lu}始终匹配大写字母.

\X转义匹配任意数量的Unicode字符. \X等价于(?>\PM\pM*)

也就是说, 它匹配一个没有”mark”属性的字符, 紧接着任意多个由”mark”属性的字符. 并将这个序列认为是一个原子组(详见下文). 典型的有”mark”属性的字符是影响到前面的字符的重音符.

用Unicode属性来匹配字符并不快, 因为PCRE需要去搜索一个包含超过15000字符的数据结构. 这就是为什么在PCRE中要使用传统的转义序列\d, \w而不使用Unicode属性的原因.


PCRE正则语法
在线手册:中文 英文
PHP手册
PHP手册 - N: Unicode字符属性

用户评论:

hayk at mail dot ru (03-Mar-2011 09:46)

There is a possibility to use \p{xx} and \P{xx} escape sequences with script names.

From http://www.pcre.org/pcre.txt

When PCRE is built with Unicode character property support, three addi-
tional escape sequences that match characters with specific  properties
are  available.   When not in UTF-8 mode, these sequences are of course
limited to testing characters whose codepoints are less than  256,  but
they do work in this mode.  The extra escape sequences are:

  \p{xx}   a character with the xx property
  \P{xx}   a character without the xx property
  \X       an extended Unicode sequence

The  property  names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any
character   (including  newline),  and  some  special  PCRE  properties
(described in the next section).  Other Perl properties such as  "InMu-
sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
does not match any characters, so always causes a match failure.

Sets of Unicode characters are defined as belonging to certain scripts.
A  character from one of these sets can be matched using a script name.
For example:

  \p{Greek}
  \P{Han}

Those that are not part of an identified script are lumped together  as
"Common". The current list of scripts is:

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,
Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp-
tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,
Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe-
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,
Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,
Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,
Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.

Each character has exactly one Unicode general category property, spec-
ified  by a two-letter abbreviation. For compatibility with Perl, nega-
tion can be specified by including a  circumflex  between  the  opening
brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
\P{Lu}.

If only one letter is specified with \p or \P, it includes all the gen-
eral  category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence  are
optional; these two examples have the same effect:

  \p{L}
  \pL

o_shes01 at uni-muenster dot de (22-Jan-2011 02:23)

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
  (*) uppercase "LJ": U+01C7
  (*) titlecase "Lj": U+01C8
  (*) lowercase "lj": U+01C9

o_shes01 at uni-muenster dot de (21-Jan-2011 06:08)

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
  (*) uppercase "LJ": U+01C7
  (*) titlecase "Lj": U+01C8
  (*) lowercase "lj": U+01C9

mercury at caucasus dot net (08-May-2010 06:32)

An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html

suit at rebell dot at (01-Mar-2010 01:13)

these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"

if you want to match any word but want to provide a fallback, you can do something like that:

<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
 
// fallback goes here
  // for example just '/\w+/u' for a less acurate match
}
?>