Valerio Di Zio
The parts I focused on most during development were:
Utility that takes care of managing Robots.txt files on Websites;
Configuration Class;
DSL rules for using the crawler and headers;
MockServer useful for testing;
Tests suites related to previous topics.
Further details of implementation for the most relevant parts are described in the following sections.
Coordinator
The coordinator is the actor with which the various crawlers interact to determine whether they can visit a URL. This decision is based on URLs that have been previously visited and the restrictions specified in the robots.txt file, which lists the URLs that crawlers are not authorized to access.
Coordinator has been designed with a recursive behavior. The behavior is unique (idle()
): once a request from a crawler is handled the behavior is updated by providing the new list of links already visited.
Is possible to identify the following message handler:
SetupRobots
, called at application startup to allow the coordinator to have the updated list of "Disallow" links in the Robot.txt;CheckPages
, used when a crawler needs to know whether it can visit a URL and to update the list of those already visited during execution.
Robots.txt
For the translation management of the robot.txt files into a set of non-visitable paths a custom parser was created. This will produce, as output, the set of disallowed paths that shouldn't be explored by crawlers.
The coordinator will retrieve this list and prevent a crawler from parsing the paths specified as ‘Disallow’ in robots.txt
Crawler
My contribution to the creation of the crawler was to allow interaction with the coordinator, and based on the response, to go and "spawn" new crawlers.
visitChildren()
method launches other crawlers based on the coordinator response.
DSL
Regarding DSL, my contribution is about the keyword: crawl
and allowing headers
to be defined directly in the config.
Crawl keyword
The DSL operators defined in the Crawl object are designed to allow smooth and readable crawler configuration through natural language-like syntax. These operators are used to specify where to start browsing and what crawling policies to adopt.
A case class that represents the execution context for the crawl configuration. It contains two variables: url and policy.
url: represents the URL from which the crawler starts browsing.
policy: an instance of ExplorationPolicy that defines the crawler's exploration strategy.
Inline method to configure the crawl context. It uses
catchRecursiveCtx
to prevent recursive calls and establishes a CrawlContext.It is used in conjunction with a globalScope that represents the global configuration of the application.
Headers keyword
The DSL provides a way to specify configurations such as network settings, headers for HTTP requests, and other options in a structured and readable manner. My role was to provide support for specifying headers to be used in the HTTP request.
This class holds a mutable map that represents the headers. The map is initially empty when the context is created and is updated as headers are defined within the DSL block.
This syntax sugar allows the user to write "HeaderName" to "HeaderValue" within a headers block.
Testing
MockServer
During the development of our application, it became apparent that reproducibility was necessary, and using real websites to test the application’s functionality was impractical for two main reasons:
The structure of the HTML could change.
Necessary tags and information might not consistently be available on a website to thoroughly test certain functionalities.
To address this issue, a MockServer was created. This server is specifically designed to provide HTML resources solely for testing purposes and is shut down once testing is complete.
Cucumber and Unit Testing
Cucumber tests and unit tests were implemented to ensure the comprehensive verification of both the overall system behavior and the functionality of individual components.
Specifically, Cucumber tests were used to validate the system’s behavior from an end-to-end perspective, ensuring that it meets the specified requirements and user expectations. On the other hand, unit tests were employed to rigorously check each component’s functionality in isolation, utilizing ScalaTest as the testing framework. This dual approach helps in identifying and addressing issues at different levels, thus contributing to the overall reliability and robustness of the application.