DOM 爬虫

编辑此页

每次使用客户端发起请求时，都会返回一个 Crawler 实例。它允许你遍历 HTML 或 XML 文档：选择节点、查找链接和表单，以及检索属性或内容。

遍历

与 jQuery 类似，Crawler 具有遍历 HTML/XML 文档 DOM 的方法。例如，以下代码查找所有 input[type=submit] 元素，选择页面上的最后一个元素，然后选择其直接父元素

        1
2
3
4
5
        $newCrawler = $crawler->filter('input[type=submit]')
    ->last()
    ->parents()
    ->first()
;
    

还有许多其他方法可用

filter('h1.title'): 匹配 CSS 选择器的节点。
filterXpath('h1'): 匹配 XPath 表达式的节点。
eq(1): 指定索引的节点。
first(): 第一个节点。
last(): 最后一个节点。
siblings(): 兄弟节点。
nextAll(): 所有后续兄弟节点。
previousAll(): 所有前面的兄弟节点。
parents(): 返回父节点。
children(): 返回子节点。
reduce($lambda): 对于可调用对象不返回 false 的节点。

由于这些方法中的每一个都返回一个新的 Crawler 实例，你可以通过链式方法调用来缩小节点选择范围

        1
2
3
4
5
6
7
8
9
10
11
        $crawler
    ->filter('h1')
    ->reduce(function ($node, int $i): bool {
        if (!$node->attr('class')) {
            return false;
        }

        return true;
    })
    ->first()
;
    

提示

使用 count() 函数获取 Crawler 中存储的节点数量：count($crawler)

提取信息

Crawler 可以从节点中提取信息

        1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
        // returns the attribute value for the first node
$crawler->attr('class');

// returns the node value for the first node
$crawler->text();

// returns the default text if the node does not exist
$crawler->text('Default text content');

// pass TRUE as the second argument of text() to remove all extra white spaces, including
// the internal ones (e.g. "  foo\n  bar    baz \n " is returned as "foo bar baz")
$crawler->text(null, true);

// extracts an array of attributes for all nodes
// (_text returns the node value)
// returns an array for each element in crawler,
// each with the value and href
$info = $crawler->extract(['_text', 'href']);

// executes a lambda for each node and return an array of results
$data = $crawler->each(function ($node, int $i): string {
    return $node->attr('href');
});
    

本作品，包括代码示例，根据 Creative Commons BY-SA 3.0 许可协议获得许可。

版本