PCRE正则语法
在线手册:中文 英文
PHP手册

递归模式

考虑匹配圆括号内字符串的问题, 允许无限嵌套括号. 如果不使用递归, 最好的方式是使用一个模式匹配固定深度的嵌套.它不能处理任意深度的嵌套. perl 5.6提供了一个实验性的功能允许正则表达式递归. 特殊项(?R)提供了递归的这种特殊用法. 这个PCRE模式解决了圆括号问题(假设PCRE_EXTENDED选项被设置了, 因此空白字符被忽略): \( ( (?>[^()]+) | (?R) )* \)

首先, 它匹配一个左括号. 然后它匹配任意数量的非括号字符序列或一个模式自身的递归匹配(比如, 一个正确的括号子串), 最终, 匹配一个右括号.

这个例子模式包含无限重复的嵌套, 因此使用了一次性子组匹配非括号字符, 这在模式应用到模式不匹配的字符串时非常重要. 比如, 当它应用到(aaaaaaaaaaaaaaaaaaaaaaaaaaaaa()时就会很快的产生”不匹配”结果. 然而, 如果不使用一次性子组, 这个匹配将会运行很长时间, 因为有很多途径让+和*重复限定分隔目标字符串, 并且在报告失败之前需要测试所有路径.

所有捕获子组最终被设置的捕获值都是从递归最外层子模式捕获的值. 如果上面的模式匹配(ab(cd)ef), 捕获子组最终被设置的值为”ef”, 即顶级得到的最后一个值. 如果增加了额外的括号, \( ( ( (?>[^()]+) | (?R) )* ) \), 捕获到的字符串就是顶层括号的匹配内容”ab(cd)ef”. 如果在模式中有超过15个捕获括号, PCRE在递归期间就会使用pcre_malloc分配额外的内存来存储数据, 随后通过pcre_free释放他们. 如果没有内存可被分配, 它就仅保存前15个捕获括号, 在递归内部无法给出内存不够用的错误.

从php4.3.3开始, (?1), (?2)等可以用于递归子组. 这同样可以用于命名子组: (?P>name)(?P&name)

如果递归子组语法在它提到的子组括号外部使用(无论是子组数字序号还是子组名称), 这个操作就相当于程序设计语言中的子程序. 前面一些有一个例子指出模式(sens|respons)e and \1ibility匹配”sense and responsibility”和”response and responsibility”, 但是不匹配”sense and responsibility”. 如果用模式(sens|respons)e and (?1)ibility替代, 它会像匹配那两个字符串一样匹配”sense and responsibility”. 这种引用方式意义是紧接着匹配引用的子模式.(译注: 后向引用只匹配引用的子组之前匹配的结果, 这里的递归语法引用是拿引用的子模式重新匹配)

目标字符串的最大长度是int型变量可以存储的最大正整数. 然而, PCRE使用递归处理子组和无限重复. 这就是说对于某些模式可用的栈空间可能会受目标字符串限制.


PCRE正则语法
在线手册:中文 英文
PHP手册
PHP手册 - N: 递归模式

用户评论:

Onyxagargaryll (03-Mar-2011 12:49)

Here's an approach to create a multidimensional array according to a string's delimiters, i.e. we want to analyze...

"some text (aaa(b(c1)(c2)d)e)(test) more text"

... as multidimensional layers.

<?php
$string
= "some text (aaa(b(c1)(c2)d)e)(test) more text";

/*
 * Analyses the string multidimensionally by its opening and closing delimiters
 */
function recursiveSplit($string, $layer) {
   
preg_match_all("/\((([^()]*|(?R))*)\)/",$string,$matches);
   
// iterate thru matches and continue recursive split
   
if (count($matches) > 1) {
        for (
$i = 0; $i < count($matches[1]); $i++) {
            if (
is_string($matches[1][$i])) {
                if (
strlen($matches[1][$i]) > 0) {
                    echo
"<pre>Layer ".$layer.":   ".$matches[1][$i]."</pre><br />";
                   
recursiveSplit($matches[1][$i], $layer + 1);
                }
            }
        }
    }
}

recursiveSplit($string, 0);

/*

Output:

Layer 0:   aaa(b(c1)(c2)d)e

Layer 1:   b(c1)(c2)d

Layer 2:   c1

Layer 2:   c2

Layer 0:   test
*/
?>

jonah at nucleussystems dot com (22-Dec-2010 02:18)

An unexpected behavior came up that introduced a very hard-to-track bug in some code I was working on.  It has to do with the preg_match_all PREG_OFFSET_CAPTURE flag.  When you capture the offset of a sub-match, it's offset is given _relative_ to it's parent.  For example, if you extract the value between < and > recursively in this string:

<this is a <string>>

You will get an array that looks like this:

Array
(
    [0] => Array
    (
        [0] => Array
        (
            [0] => <this is a <string>>
            [1] => 0
        )
        [1] => Array
        (
            [0] => this is a <string>
            [1] => 1
        )
    )
    [1] => Array
    (
        [0] => Array
        (
            [0] => <string>
            [1] => 0
        )
        [1] => Array
        (
            [0] => string
            [1] => 1
        )
    )
)

Notice that the offset in the last index is one, not the twelve we expected.  The best way to solve this problem is to run over the results with a recursive function, adding the parent's offset.

emanueledelgrande at email dot it (09-Jan-2010 11:47)

The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):

$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";

The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.
I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:

<?php
// test function:
function parse($html) {
   
// I have split the pattern in two lines not to have long lines alerts by the PHP.net form:
   
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".
   
"(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";
   
preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);
   
$elements = array();
   
    foreach (
$matches[0] as $key => $match) {
       
$elements[] = (object)array(
           
'node' => $match[0],
           
'offset' => $match[1],
           
'tagname' => $matches[1][$key][0],
           
'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',
           
'omittag' => ($matches[4][$key][1] > -1), // boolean
           
'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''
       
);
    }
    return
$elements;
}

// random html nodes as example:
$html = <<<EOD
<div id="airport">
    <div geo:position="1.234324,3.455546" class="index">
        <!-- comment test:
        <div class="index_top" />
        -->
        <div class="element decorator">
                <ul class="lister">
                    <li onclick="javascript:item.showAttribute('desc');">
                        <h3 class="outline">
                            <a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>
                        </h3>
                        <div class="description">Sample description</div>
                    </li>
                </ul>
        </div>
        <div class="clean-line"></div>
    </div>
</div>
<div id="omittag_test" rel="rootChild" />
EOD;

// application:
$elements = parse($html);

if (
count($elements) > 0) {
    echo
"Elements found: <b>".count($elements)."</b><br />";
   
    foreach (
$elements as $element) {
        echo
"<p>Tpl node: <pre>".htmlentities($element->node)."</pre>
        Tagname: <tt>"
.$element->tagname."</tt><br />
        Attributes: <tt>"
.$element->attributes."</tt><br />
        Omittag: <tt>"
.($element->omittag ? 'true' : 'false')."</tt><br />
        Inner HTML: <pre>"
.htmlentities($element->inner_html)."</pre></p>";
    }
}
?>