Giovanni Antonioni
The main areas that I've contributed to on the implementation side include:
As side works I've also worked on setting up the CI/CD pipeline and the documentation system for the project.
Crawler
For the Crawler component I've followed the Akka's FSM design principle
using the Behavior DSL. Is possible to identifying the following two states:
The
idle
state, where the actor receives the url to crawl and starts the process of checking the documents frontier, starts a scraper children actor and the sub crawlers.The
waitForChildren
state, where the actor awaits if the spawned children to terminate their computations
Each of the following state is managed by a specific function that handle the state transition and the message processing.
Crawler's Exploration Policy
An Exploration Policy describe the way crawlers fetch links from a page. It's represented by a function that receive as input a HTML Document (Crawl Document) and that returns an iterable of URLs.
Defining an exploration policy in terms of a function allows to easily change the behavior of the crawler and to extend it with new functionalities.
As example, we can describe an exploration policy that only fetch same domain urls:
It's important to note that all crawlers will assume the same exploration policy inside the system and that should be configured at application startup.
Exporter
Similar to Crawler the Exporters is also designed as an actor entity that awaits to receive a Result message from a Scraper containing partial data from the scraping process. Based on the type of Exporter the final result is handled differently: with the StreamExporter
is processed immediatelly:
while with the BatchExporter
is accumulated until the end of the scraping process:
We can control different aspects for exporters, as the aggregation behavior, the exporting behavior and the output format. These are all defined as custom Scala types that are passed to the constructor of the Exporter during its creation:
Rule system DSL
The part I've implemented on the DSL side is the rule system for the scraper. The rule system is a set of keywords that allow to define which elements of the page should be scraped based on different conditions.
When an user define a scrape
block on the DSL snippet, a ScrapingContext is opened and it's possible to sets a series of rules to define the scraping policy.
An example of DSL:
In the example above, the that
keyword is an alias for the Scala collection method filter
while, haveAttributeValue
, haveClass
and followRule
are methods that generates a predicate for HTML elements.
Note that the followRule
uses a method (catchRecursiveCtx[HTMLElement]("rule")
) for checking the context in which it's used. This prevent to recursively use the followRule
keyword inside another followRule
block.