Sobak's crawler

ResultWriter Block

Result writers are responsible for the final step Scrawler performs, persisting scrapped data. Since their underlying logic can be quite complex they all accept configuration array. Basic configuration handling is one of the things you can get by extending from AbstractResultWriter class.

Result writers are set per entity which means that all scrapped entity types can be persisted in their own way.

Meta: File result writers

This is just a meta section, not the block itself. There are a couple of result writer implementations which ultimately write a file to your disk. To encapsulate common logic they extend from the FileResultWriter abstract class but also share some behavior.

First, they all accept directory configuration option which can be passed to further specify results destination relative to the output directory for your operation.

Second, they need to accept a filename provider object under the filename key. Read more about them in the dedicated documentation section and choose one suitable to your needs.

CsvFileResultWriter

This result writers will write each entity as a row of the CSV file. Aside from all generic options for file result writers mentioned above it also accepts headers array. Passing it will create one header row at the top of the file and will make all entity properties being ordered consistently with headers, otherwise properties will be ordered alphabetically.

This writer depends on League CSV, install it with composer require league/csv "^9.2" first

->addResultWriter(PostEntity::class, new CsvFileResultWriter([
    'filename' => new LiteralFilenameProvider(['filename' => 'posts']),
    'headers' => [
        'date' => 'Published on',
        'title' => 'Title',
        'slug' => 'Slug',
    ],
]))

Since this result writer will create one file for all of its entities you will probably want to use the LiteralFilenameProvider to get the filename.

CSV generation can be customized by setting delimiter, enclosure and escape_char options. You can check PHP manual on fgetcsv to get their meaning.

In order to have UTF-8 characters read properly by Microsoft Excel on Windows you will need a BOM character at the beginning of the file. You can tell Scrawler to insert it for you by setting insert_bom config option to true (false by default).

DatabaseResultWriter

Saves results into almost any of the databases supported by PDO.

This writer depends on Doctrine, install it with composer require doctrine/orm "^2.6" first

In order to configure the database result writer pass connection parameters parameters as connection key, accordingly to the Doctrine documentation. Please note that the URL connection string is not supported, you need to use an array.

You can pass an additional parameter simple_annotations which when set to true parses Doctrine annotations without importing their namespace (false by default).

Sample configuration for the database result writer might then look like so:

// MySQL
->addResultWriter(PostEntity::class, new DatabaseResultWriter([
    'connection' => [
        'driver'   => 'pdo_mysql',
        'user'     => 'root',
        'password' => '',
        'dbname'   => 'scrawler',
    ],
]))

// SQLite
// Note that relative paths are resolved based on the location Scrawler's CLI
// was called at so it's more reliable to stick to the absolute paths.
->addResultWriter(PostEntity::class, new DatabaseResultWriter([
    'connection' => [
        'driver'   => 'pdo_sqlite',
        'path'     => '/var/scrawler/results.sqlite',
    ],
    'simple_annotations' => true,
]))

To use this writer you need to describe all entities used by it with Doctrine annotation mappings. Relationships are not supported at the time of writing.

The database will be created if it doesn't exist and the user specified in the connection has permissions to add new database. When running Scrawler, all tables described by the entities assigned to this result writer will be dropped and created from scratch with fresh data.

DumpResultWriter

This is probably the simplest result writer available. All it does is dumping the entity into the console. It tries to use the VarDumper component from Symfony to provide maximum readability, so you might want to install it (composer require symfony/var-dumper) since otherwise Scrawler will fall back to simple var_dump() call.

This result writer does not define any additional configuration options.

JsonFileResultWriter

This block implementation will store each of your entities into single JSON file. Every entity property will be used as property name in the resulting file.

This is a file result writer. You might want to read the general section about them.

Aside from options specific to all file result writers the JSON result writer can accept options confiuration key which will be passed as options to the json_encode() PHP function. The default value is JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE.

TemplateFileResultWriter

The most generic out of all file result writers. Does not have any particular assumpions - instead it only takes template string where you can use variables referencing fields defined by your entity getters.

Aside from specifying template string under template configuration key you can also set the extension. Omitting that key or setting it explicitly to null will result in files being created with no extension.

->addResultWriter(PostEntity::class, new TemplateFileResultWriter([
    'directory' => 'posts/',
    'extension' => 'txt',
    'filename' => new IncrementalFilenameProvider(),
    'template' => 'Published at {{date}}'
]))

In real-life examples you will probably want to load the template from file and keep it outside of your main configuration. Currently variable names are limited to [a-zA-Z_] so it's better not to be too cratetive with getter names for this use-case. You can insert any number tabs or spaces between variable names and curly braces delimiting it. Finally, the variable with no counterpart in the defined getters will result in the variable being printed directly into the result file, like so: Published at {{date}}.

Even though this result writer can theoretically replace many others (you could even type JSON template by hand) you should use specialized writers whenever possible, including writing your own implementations.