A Domain-Specific Language (DSL) has been developed as an alternative to a standard API for configuring the application.
Adhering to design principles, the DSL is implemented on top of the existing functional API. This ensures that the system remains independent of the DSL's design and implementation.
The DSL is organized into modules, each representing a different aspect of the application's configuration. There are four modules: Config, Crawl, Scrape, and Export. Each module is divided into two parts: Context and Ops.
Here is an example within the Config module:
Contexts represent the scopes within the DSL. For instance, in the following snippet:
scooby:
scrape:
elements
Both scooby and scrape define Contexts.
Ops, on the other hand, are the "keywords" of the language. In the above snippet, all three words (scooby, scrape, and elements) are considered Ops.
This design allows the restriction of valid Ops to specific Contexts. The actual mechanism to enforce this is an implementation detail.
Language Specification
The primary part of the DSL specification is represented in the following EBNF code snippet:
Note: Since the language is implemented as an internal DSL, there are many valid programs beyond this specification. Thus, the provided specification is only partial.
Here are some examples of valid programs written in this DSL.
Example 1: Full Settings Provided
This snippet demonstrates the use of all available settings, providing a comprehensive example of a Scooby DSL program.
scooby:
config:
network:
NetworkTimeout is 5.seconds
MaxRequests is 100
headers:
"User-Agent" to "Scooby/1.0"
option:
MaxDepth is 2
MaxLinks is 100
crawl:
url:
"https://www.example.com/"
policy:
hyperlinks not external
scrape:
elements
exports:
batch:
strategy:
results get tag output:
toFile("test.json") withFormat json
aggregate:
_ ++ _
This program is designed to crawl the URL "https://www.example.com/", recursively visiting all found hyperlinks that do not redirect to external domains.
For each page, all HTML elements are scraped and their HTML tags are exported to a file named test.json in JSON format.
Example 2: Scrape and Export
This snippet includes only the scrape and exports sections, as well as the mandatory crawl section to set the root URL.
scooby:
crawl:
url:
"https://www.example.com/"
scrape:
elements that (haveTag("a") and haveClass("gorgeous"))
exports:
streaming:
results output:
toConsole withFormat text
This program is designed to crawl the URL "https://www.example.com/", recursively visiting all found hyperlinks ( default behavior).
For each page crawled, the scraping will focus only on HTML elements with the tag "a" and the class attribute "gorgeous". Each of these elements' outer HTML will be exported in text format and printed to the console.
Example 3: More Advanced
This snippet uses the DSL in a more advanced and less "pure" form, fully leveraging the advantages of an internal implementation.
This program crawls the URL "https://www.example.com/", recursively visiting all links that are relative URLs and whose text ends with "example".
For each page crawled, the scraping focuses on HTML elements whose parent's text is not empty and that contain a non-empty "alt" attribute.
The export process outputs key-value pairs to a file named test.json in JSON format, where the keys are the HTML tags of the scraped elements and the values are the number of occurrences of each tag.