In-Depth Design
Project Structure
The structure of the project is divided into three principal components: DSL
, core
and utils
:
The
core
contains the main entities and the logic to manage them;The
DSL
component manages the way the user configures the system using our custom internal domain-specific language;The
utils
component contains utility classes and functions.
Components
Core
The core package of the Scooby library contains the main entities that are involved in the scraping process. These are the Crawler
, Scraper
, Coordinator
, and Exporter
. Scooby
is the entity responsible for the system start-up and management.
Crawler
A Crawler is an actor responsible for searching and exploring links on web pages. It interacts with a coordinator to validate found URLs, creates scrapers to extract data, and spawns new crawlers to explore new URLs. A Crawler is able to download the content of a web page using the HTTP utility class and parse it with the Document component of the utils
package.
Crawler Messages
Message | Description |
---|---|
| Start crawling a specific url. |
| Receive a response from the coordinator. This message should be sent only by the Coordinator of the system |
| Signal this crawler that one of it's sub-crawler has terminated its computation. |
Scraper
A Scraper is an actor responsible for extracting data from a web page. It receives a document from a crawler, extracts the relevant information, and sends the results to one or more exporters.
Scraper Messages
Message | Description |
---|---|
| Start to scrape a specific document |
Coordinator
The Coordinator is an actor that validates the URLs found by Crawlers. Checks are usually based on a set of rules defined by the user, defining a policy that dictates which URLs are valid and which are not. Coordinator also controls if a URL was already visited by a crawler and if it's allowed in the robot file of the website.
Coordinator Messages
Message | Description |
---|---|
| Parse and obtain rules for the robot file of the url |
| Check the pages fetched by a Crawler |
Exporter
The Exporter is an actor responsible for exporting the scraped data. It receives results from the Scrapers and exports them in a specific format. Scooby supports two types of exporters: StreamExporter
and BatchExporter
. The former exports data as soon as it is scraped, while the latter aggregates the results and exports them all at once. For both kind of exporters, is possible to define a behaviour that specify the format of the output and how to export it.
Exporter Messages
Message | Description |
---|---|
| Export the result of a scraping operation. |
| Signal the end of the export process. |
Scooby
Scooby is the main entity of the system, responsible for starting the system and managing the entities. It receives a Configuration
object that describes the desired settings and starts the system accordingly.
Scooby Messages
Message | Description |
---|---|
| Starts the application. |
| Signal that the operations for checking the Robot file are finished. |
| Signal the end of exporting operations. |
DSL
The DSL component is responsible for managing the way the user configures the system using our custom internal domain-specific language. Every main entity in the system has a corresponding set of operations that can be used to produce a desired configuration described by a Configuration
object, that will then be used by the Scooby
entity to start the system.
Utils
The utils
component contains utility classes and functions that are used by the core entities. Document
represents an HTML document that is being fetched from a URL while the HTTP
trait is used to download the content of a web page.
Document
Document allows to easily retrieve and work with the content of a web page. Crawlers and Scrapers use this feature to parse the content of a page and extract the relevant information. The operation that are allowed on a document are defined by the Explorer mixins that are used. In our application we've two kind of document:
CrawlDocument
: For crawling operationsScrapeDocument
: For scraping operations
The former should only have operations for fetching links, so the LinkExplorer
and EnhancedLinkExplorer
mixins are used while, for the latter, we used the SelectorExplorer
, CommonHTMLExplorer
and RegExpExplorer
mixins, enabling document to extract data from the page.
HTTP
HTTP is a utility component that allows to wrap an HTTP client library for download and parse the content of a web page with a given simple and easy to use API.