Sobak's crawler

UrlListProvider Block

URL list provider is a type of block that is required for Scrawler to work properly, thus at least one must be registered. The URLs provided by the block will be used by Scrawler right after base URL so that it knows where to crawl.

Therefore each URL list provider is provided with context of previous URL and last response. You can define your own implementations to focus exactly on the URLs you need to process.

URL list providers are registered using addLogWriter() method on the main Configuration class.

$configuration
    ->addUrlListProvider(new EmptyUrlListProvider())
;

Each URL list provider object is fed with useful data before getUrls() method is called such as base and current URL and the currently processed response object. In most cases you will want to use this data to obtain list of next URLs for your custom implementation.

When trying to read response's content please remember it is a stream so you need to rewind() it before reading, otherwise you will get empty string as it has been read before.

ArgumentAdvancerUrlListProvider

This URL list provider implementation allows to define the URL template in which you want certain part of that URL to increase every time up to some point.

public function __construct(string $template, int $start = 1, int $step = 1, ?int $stop = null)

As you can see, by default it starts with one and increments counter by one after every processed URL. Incremental part is marked using %u in the URL template. When you omit the $stop parameter URLs will be incremented to the point where HTTP 404 is received from the server. Note that it will cause entry into your logs that you should not worry about.

addUrlListProvider(new ArgumentAdvancerUrlListProvider('http://sobak.pl/page/%u', 2, 1))

You can also use relative URL in the template — in such case your base URL will be prepended to it so the repetition can be avoided:

addUrlListProvider(new ArgumentAdvancerUrlListProvider('/page/%u', 2, 1))

In the example above Scrawler will head to http://sobak.pl/page/2 after processing the base URL, then it will continue increasing the number until server responds with HTTP 404.

ArrayUrlListProvider

This URL provider accepts an array of URLs expressed as strings which will be then processed by Scrawler. Each URL is normalized against the base URL. Empty lines and unsupported URLs, e.g. with protocol other than HTTP(S) will be silently ignored.

Do not include your base URL in the list, otherwise it will be processed twice.

EmptyUrlListProvider

This URL provider returns empty list what ultimately makes Scrawler parse base URL only.

SameDomainUrlListProvider

Looks for href attributes of a tags and returns links with the domain matching domain of current operation's base URL. Gathered URLs are normalized (resolving relative paths, protocols, removing anchors etc) before the domain comparison is made. Any unsupported URLs, e.g. with protocol other than HTTP(S), will be silently ignored.