In-Depth Design

Project Structure

The structure of the project is divided into three principal components: DSL, core and utils:

The core contains the main entities and the logic to manage them;
The DSL component manages the way the user configures the system using our custom internal domain-specific language;
The utils component contains utility classes and functions.

Components

Core

The core package of the Scooby library contains the main entities that are involved in the scraping process. These are the Crawler, Scraper, Coordinator, and Exporter. Scooby is the entity responsible for the system start-up and management.

Crawler

A Crawler is an actor responsible for searching and exploring links on web pages. It interacts with a coordinator to validate found URLs, creates scrapers to extract data, and spawns new crawlers to explore new URLs. A Crawler is able to download the content of a web page using the HTTP utility class and parse it with the Document component of the utils package.

Crawler Messages

Message	Description
`Crawl(url: URL)`	Start crawling a specific url.
`CrawlerCoordinatorResponse(result: Iterator[URL])`	Receive a response from the coordinator. This message should be sent only by the Coordinator of the system
`ChildTerminated()`	Signal this crawler that one of it's sub-crawler has terminated its computation.

Scraper

A Scraper is an actor responsible for extracting data from a web page. It receives a document from a crawler, extracts the relevant information, and sends the results to one or more exporters.

Scraper Messages

Message	Description
`Scrape(document: ScrapeDocument)`	Start to scrape a specific document

Coordinator

The Coordinator is an actor that validates the URLs found by Crawlers. Checks are usually based on a set of rules defined by the user, defining a policy that dictates which URLs are valid and which are not. Coordinator also controls if a URL was already visited by a crawler and if it's allowed in the robot file of the website.

Coordinator Messages

Message	Description
`SetupRobots(url: URL)`	Parse and obtain rules for the robot file of the url
`CheckPages(pages: List[URL], replyTo: ActorRef[CrawlerCoordinatorResponse])`	Check the pages fetched by a Crawler

Exporter

The Exporter is an actor responsible for exporting the scraped data. It receives results from the Scrapers and exports them in a specific format. Scooby supports two types of exporters: StreamExporter and BatchExporter. The former exports data as soon as it is scraped, while the latter aggregates the results and exports them all at once. For both kind of exporters, is possible to define a behaviour that specify the format of the output and how to export it.

Exporter Messages

Message	Description
`Export(result: Result[T])`	Export the result of a scraping operation.
`SignalEnd(replyTo: ActorRef[ScoobyCommand])`	Signal the end of the export process.

Scooby

Scooby is the main entity of the system, responsible for starting the system and managing the entities. It receives a Configuration object that describes the desired settings and starts the system accordingly.

Scooby Messages

Message	Description
`Start`	Starts the application.
`RobotsChecked(found: Boolean)`	Signal that the operations for checking the Robot file are finished.
`ExportFinished`	Signal the end of exporting operations.

DSL

The DSL component is responsible for managing the way the user configures the system using our custom internal domain-specific language. Every main entity in the system has a corresponding set of operations that can be used to produce a desired configuration described by a Configuration object, that will then be used by the Scooby entity to start the system.

Utils

The utils component contains utility classes and functions that are used by the core entities. Document represents an HTML document that is being fetched from a URL while the HTTP trait is used to download the content of a web page.

Document

Document allows to easily retrieve and work with the content of a web page. Crawlers and Scrapers use this feature to parse the content of a page and extract the relevant information. The operation that are allowed on a document are defined by the Explorer mixins that are used. In our application we've two kind of document:

CrawlDocument: For crawling operations
ScrapeDocument: For scraping operations

The former should only have operations for fetching links, so the LinkExplorer and EnhancedLinkExplorer mixins are used while, for the latter, we used the SelectorExplorer, CommonHTMLExplorer and RegExpExplorer mixins, enabling document to extract data from the page.

HTTP

HTTP is a utility component that allows to wrap an HTTP client library for download and parse the content of a web page with a given simple and easy to use API.

Last modified: 07 August 2024