Sobak's crawler

Matcher Block

Matcher blocks are always uses in context of extracting something from the webpage contents. They are responsible for providing content for either blocks or objects (see Getting started for the definitions) and therefore they are gouped into:

HTML matchers always look for content in scope of their current object exclusively. For example if you define an object for posts on some website, HTML matchers assigned to the title field will only look for some pattern inside single post that is currently processed.

For the usage examples please read the Configuration chapter or check the code sample on the main page.

CssSelectorHtmlMatcher

This is one of the most basic HTML matchers which works in a manner very similar to jQuery. Just pass the CSS selector specifying the element you are interested in.

->addFieldDefinition('title', new CssSelectorHtmlMatcher('h1.entry-title a'))

CssSelectorListMatcher

List variant of the CSS selector matcher used to provide the objects.

->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
    // ...
})

RegexHtmlMatcher

Matches HTML (or simple text without tags depending on the selection being made) using regular expression. It always looks for the named group called result.

->addFieldDefinition('age', new RegexHtmlMatcher('#I am (?P<result>\d+) years old?#m'))

XpathHtmlMatcher

Matches HTML (or simple text) based on the XPath expression.

XpathListMatcher

Matches list of elements for objects using the XPath expression.