Sobak's crawler

Configuration

As mentioned previously the configuration for scrawler — the default commandline entrypoint — should live in a file and should be object of class Configuration being returned by that file.

Make sure to read the getting started chapter first to have understanding of basic Scrawler's concepts.

Basic options

The basic options only take single value which is provided using classing PHP setter method.

setBaseUrl()

Setting this value is mandatory. It should be an absolute URL Scrawler will start crawling from.

setOperationName()

Setting this value is mandatory. This can be any string that will uniquely define your configuration set. However, it will be used to name then output directory, will appear in the logs etc, so it makes sense to set it to something meaningful.

setMaxCrawledUrls()

The maxCrawledUrls option specifies a number of unique URLs Scrawler will process. Omitting this option or setting it to 0 means that all the available URLs (given by registered URL list providers) will be fetched and parsed.

setRobotsParser()

You can set single robots parser block which will be used to check all found URLs against the rules set by the website owner. For more details please consult separate chapter of the documentation.

Default configuration

In order to apply set of recommened configuration options you should pass your Configuration into DefaultConfigurationProvider class where it will be modified by the reference. Be sure to review what the default options are.

$scrawler = new Configuration();
$scrawler = (new DefaultConfigurationProvider())->setDefaultConfiguration($scrawler);

$scrawler
    ->setOperationName('Test config')
    // ...
;

Blocks

Idea of blocks was described in the previous chapter so we will skip the theory now. In context of configuration you need to know that you can add infinite number implementations for any block type to configure your operation.

Adding blocks (either top-level or for objects described below) usually follows the same API pattern:

addBlockName(new SomeBlockImplementation(...));

For the top-level configuration (outside of blocks) you can use the following block types:

For example:

return (new Configuration())
    ->setOperationName('Sample config')
    ->setBaseUrl('http://sobak.pl')
    ->addUrlListProvider(new EmptyUrlListProvider())
;

Objects

Objects are added using addObjectDefinition on the Configuration object and their own configuration is provided using a clousure. Each object must consist of at least one field which is then added using addFieldDefinition method of the ObjectConfiguration class (available as the argument of the closure).

return (new Configuration())
    ->setOperationName('Sample config')
    ->setBaseUrl('http://sobak.pl')
    ->addUrlListProvider(new EmptyUrlListProvider())
    ->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))
        ;
    })
;

As you can see above, aside from the object name (which can be any unique string) and the closure, you need to provide one more element when describing objects. This is a matcher instance. Matchers (described in the Blocks part of the documentation) are responsible for picking out information from the current page. There are two main types of matchers: ones returning single element and the list of elements. For objects you will need to use the latter even if you expect just one match for the object in the course of entire operation (just like website header/description example given in previous chapter). However, if you expect that given object will be the same on every page, you can mark it using once() method so that Scrawler will not spend time on mapping and writing it over and over:

return (new Configuration())
    ->setOperationName('Sample config')
    ->setBaseUrl('http://sobak.pl')
    ->addUrlListProvider(new EmptyUrlListProvider())
    ->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))
        ;
    })
    ->addObjectDefinition('meta', new CssSelectorListMatcher('body'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('description', new CssSelectorHtmlMatcher('h2.site-description'))
            ->once()
        ;
    })
;

Object fields

As you can see configuration for an object field is quite similar to the config for the object itself. The main difference is that the matcher used there needs to return single element representing that field.

All matching criteria for object fields are considered in the scope of object itself. In the example above time.entry-date will only be matched inside a specific object which is being parsed at the moment so you don't have to repeat matching criteria (e.g. using article.hentry time.entry-date).

Entity mapping

Next you have to provide entity object results will be mapped to. Entity classes are defined by you, serve mostly as simple Data Transfer Objects and have been discussed separately in the next chapter.

return (new Configuration())
    ->setOperationName('Sample config')
    ->setBaseUrl('http://sobak.pl')
    ->addUrlListProvider(new EmptyUrlListProvider())
    ->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))
            // obviously you can define more fields...
            ->addEntityMapping(PostEntity::class)
        ;
    })
    ->addObjectDefinition('meta', new CssSelectorListMatcher('body'), function (ObjectConfiguration $object) {
            $object
            ->addFieldDefinition('description', new CssSelectorHtmlMatcher('h2.site-description'))
            ->addEntityMapping(MetaEntity::class)
            ->once()
        ;
    })
;

Result writers

The last piece of the puzzle is providing the class(es) responsible for saving the results. Since result writers are de facto blocks you add them using addResultWriter() method. As the name implies you can register multiple result writers — you can write different entities each in their own way or even write single entity to multiple destinations.

There is a separate chapter describing available result writers and their configuration but let's finish our example with something simple:

return (new Configuration())
    ->setOperationName('Sample config')
    ->setBaseUrl('http://sobak.pl')
    ->addUrlListProvider(new EmptyUrlListProvider())
    ->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))
            ->addEntityMapping(PostEntity::class)
            ->addResultWriter(PostEntity::class, new DumpResultWriter())    
        ;
    })
    ->addObjectDefinition('meta', new CssSelectorListMatcher('body'), function (ObjectConfiguration $object) {
            $object
            ->addFieldDefinition('description', new CssSelectorHtmlMatcher('h2.site-description'))
            ->addEntityMapping(MetaEntity::class)
            ->addResultWriter(MetaEntity::class, new JsonFileResultWriter([
                'filename' => new LiteralFilenameProvider(['filename' => 'meta']),
            ]))
            ->once()
        ;
    })
;

That should finally give us something. There is a lot of information in this chapter but as you can see you can dump dates of all posts from my blog homepage in around ten lines of code.

You should of course continue reading this documentation to do something more useful than dumping dates on the screen…