Sobak's crawler

Entities

When Scrawler parses the website its ultimate goal is to map scrapped information to entities and then to write it. This documentation chapter focuses on entities aspect, their mapping details and some general tips on how to write them.

Entities are plain PHP objects, they don't have to implement any interface or extend from any base class. Their primary responsibility is to store data fetched by the Scrawler so that it can be then used by result writers (take a look at the getting started chapter for more details on the flow). That's why typically entities do not contain much logic. Most of what they do is defining properties (most recommended are private or protected) along with getters and setters so they act as a Data Transfer Object.

<?php

class Post
{
    protected $content;

    protected $date;

    protected $title;

    public function getContent()
    {
        return $this->content;
    }

    public function setContent($content)
    {
        $this->content = $content;
    }

    public function getDate()
    {
        return $this->date;
    }

    public function setDate($date)
    {
        $this->date = $date;
    }

    public function getTitle()
    {
        return $this->title;
    }

    public function setTitle($title)
    {
        $this->title = $title;
    }
}

You can use PHP argument and return type declarations to enforce type correctness.

When writing data to the entity Scrawler uses your setters to hydrate an object and set values for underlying properties. This process is called entity mapping. It means that if you define property without a setter its value will not be set. Instead it will be silently ignored and the value will be null even if you define the getter. This is by design as Scrawler does not require you to map every field of object definition to the entity field. Note that this is true even if you make property public.

Similar process happens inside result writers when they read from entities. If some property doesn't define its getter the value will not be taken into the account. In this case, however, that may cause a failure if your result writer configuration makes that field required (e.g. by using database result writer with non-nullable column).

Entity mapping

Your entities will be written and read by the Scrawler internally. It means that whole entity mapping logic have been outlined above and you have no influence over it (e.g. there is no toArray() method or similar). You can, however, use getters and setters to achieve custom mapping. For example if the underlying result writer expects post date to be a PHP DateTime class object instead of pure string representation scrapped from the website you could call DateTime constructor in your setter and use the argument.

You can even create a kind of virtual fields for your entity which will not be read from the website directly but computed from other available information. Following example above if you would like to store a slug for a post and have it generated from the title you can add one more property with getter.

protected $slug;

// ...

public function getSlug()
{
    return str_replace(' ', '-', $this->title);
}

Setter does not really make sense in this case as we assume you will not be providing definition for the slug field in your configuration. Result writer will see the slug property in your entity and attempt to use it, though.

You can mix the trick shown above with adding multiple entity mappings for single Scrawler's object (as defined in Getting started chapter) to create powerful configuration and have same scrapped data written to multiple destinations with varying fields. Take a look at the Cookbook section for real-life examples.

Finally, your entities may sometimes need additional mapping metadata. This is a case e.g. when using database result writer. The Doctrine library which is used under the hood will expect each property to be described by annotation so that proper database column can be used for it. Then your post entity may look like so:

<?php

declare(strict_types=1);

namespace App;

use Doctrine\ORM\Mapping as ORM;
use Sobak\Scrawler\Support\Utils;

/**
 * @ORM\Entity()
 * @ORM\Table(name="posts")
 */
class Post
{
    /**
     * @ORM\Column(type="text")
     */
    protected $content;

    /**
     * @ORM\Column(type="string")
     */
    protected $date;

    /**
     * @ORM\Id()
     * @ORM\Column(type="string")
     */
    protected $slug;

    /**
     * @ORM\Column(type="string")
     */
    protected $title;

    public function getContent(): string
    {
        return $this->content;
    }

    public function setContent(string $content): void
    {
        $this->content = $content;
    }

    public function getDate(): string
    {
        return $this->date;
    }

    public function setDate(string $date): void
    {
        $this->date = $date;
    }

    public function getSlug(): string
    {
        return $this->slug;
    }

    public function getTitle(): string
    {
        return $this->title;
    }

    public function setTitle(string $title): void
    {
        $this->slug = Utils::slugify($title);
        $this->title = $title;
    }
}