Sobak's crawler

Cookbook: writing parts of an object to different storages

Examples shown in the documentation so far present somehow simple flow regarding the results where object we define is first mapped over the user-provided entity and then written down to the storage of choice, like a database or set of JSON files. Inside the entity classes we define every field of an object, then create getters and setters for each of them and that's more or less it.

However, this definitely isn't all Scrawler can do. We can notice that result writers are not set (so exchanged one for another) but added to the stack. This is done to allow defining multiple objects, each having their own result writer but it is also done so that same object (and thus its resulting entity) can be written multiple times using different logic (contained in Blocks).

For this study case we'll scrap posts from my blog but instead of using single writer for an object wel'll save posts' metadata into single CSV file while the content of each post will be kept inside individual HTML file.

<?php

declare(strict_types=1);

use App\PostContentEntity;
use App\PostMetaEntity;
use Sobak\Scrawler\Block\ClientConfigurationProvider\ScrawlerUserAgentProvider;
use Sobak\Scrawler\Block\Matcher\CssSelectorHtmlMatcher;
use Sobak\Scrawler\Block\Matcher\CssSelectorListMatcher;
use Sobak\Scrawler\Block\ResultWriter\FilenameProvider\IncrementalFilenameProvider;
use Sobak\Scrawler\Block\ResultWriter\FilenameProvider\LiteralFilenameProvider;
use Sobak\Scrawler\Block\ResultWriter\CsvFileResultWriter;
use Sobak\Scrawler\Block\ResultWriter\TemplateFileResultWriter;
use Sobak\Scrawler\Block\UrlListProvider\ArgumentAdvancerUrlListProvider;
use Sobak\Scrawler\Configuration\Configuration;
use Sobak\Scrawler\Configuration\DefaultConfigurationProvider;
use Sobak\Scrawler\Configuration\ObjectConfiguration;

require 'vendor/autoload.php';

$scrawler = new Configuration();
$scrawler = (new DefaultConfigurationProvider())->setDefaultConfiguration($scrawler);

$scrawler
    ->setOperationName('Blog posts')
    ->setBaseUrl('http://sobak.pl')
    ->addClientConfigurationProvider(new ScrawlerUserAgentProvider('Scrawler Test', 'http://sobak.pl'))
    ->addUrlListProvider(new ArgumentAdvancerUrlListProvider('/page/%u', 2))
    ->addObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {
        $object
            ->addFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))
            ->addFieldDefinition('content', new CssSelectorHtmlMatcher('div.entry-content'))
            ->addFieldDefinition('slug', new CssSelectorHtmlMatcher('h1.entry-title a[href]'))
            ->addFieldDefinition('title', new CssSelectorHtmlMatcher('h1.entry-title a'))
            ->addEntityMapping(PostContentEntity::class)
            ->addEntityMapping(PostMetaEntity::class)
            ->addResultWriter(PostMetaEntity::class, new CsvFileResultWriter([
                'filename' => new LiteralFilenameProvider(['filename' => 'posts']),
                'insert_bom' => true,
            ]))
            ->addResultWriter(PostContentEntity::class, new TemplateFileResultWriter([
                'directory' => 'posts/',
                'extension' => 'html',
                'filename' => new IncrementalFilenameProvider(),
                'template' => '{{content}}'
            ]))
        ;
    })
;

return $scrawler;

Contrary to the earlier examples we define two entities to represent the post, each consisting of different fields. Then we will assign different type of result writer to each entity to achieve the desired effect.

Result writer will attempt to write every field it will find a getter for. Therefore in order to only write the subset of an object data we need the separate entity.

The entity that will represent post metadata will look like so:

<?php

declare(strict_types=1);

namespace App;

class PostMetaEntity
{
    protected $date;

    protected $slug;

    protected $title;

    public function getDate()
    {
        return $this->date;
    }

    public function setDate($date)
    {
        $this->date = $date;
    }

    public function getSlug()
    {
        return $this->slug;
    }

    public function setSlug($slug)
    {
        $this->slug = $slug;
    }

    public function getTitle()
    {
        return $this->title;
    }

    public function setTitle($title)
    {
        $this->title = $title;
    }
}

Notice the lack of the content field. That way it will not be written into the CSV file (where it would perhaps decrease its readability and be hard to utilize in any way). Instead we will create the dedicated entity to hold the content, that will be later handled by the TemplateFileResultWriter putting it into HTML files.

<?php

declare(strict_types=1);

namespace App;

class PostContentEntity
{
    protected $content;

    public function getContent()
    {
        return $this->content;
    }

    public function setContent($content)
    {
        $this->content = $content;
    }
}

Being strict, we could use single entity for this case because TemplateFileResultWriter accepts placeholder of fields it puts into the fileā€¦ Nonetheless, treat this writeup as a general advice on how to write parts of results to many different places.