Scooby Help

Application Lifecycle

The Scooby application follows a specific sequence of steps that define its lifecycle. Understanding these steps is crucial for interpreting the results and effectively using this library.

Below is a diagram illustrating the general structure:

Spawn Scooby actorSpawn Coordinator actorIdleSpawn Exporter actorsSpawn Root Crawler actorIdleSignal the end of execution to all ExportersIdleTerminate actor systemSetup robots.txtCrawlingExportStart callbackEnd callbackMainScoobyCoordinatorCrawlerExporter

Key points to note from this diagram are:

  1. The robots.txt file must be managed before any other steps.

  2. The system waits only for the Root Crawler to finish, ignoring other crawlers.

  3. The system must wait for all Exporters to complete their execution.

While most steps are independent of each other, these three steps are blocking and must be completed before proceeding. Managing these dependencies can be challenging, especially in an asynchronous Actor system.

Robots.txt

Checking the robots.txt file is essential to comply with the Robot Exclusion Protocol. This check is performed by the Coordinator actor, and we must ensure that URL exclusion rules are applied before any Crawler requests URL validation. Thus, checking robots.txt is a blocking step that must be completed before any Crawler is spawned and begins crawling.

For more details, see the Coordinator section.

Crawler Tree

The system is designed so that a Crawler can finish its execution under one of four conditions:

  • No valid links to explore on the crawled page.

  • It is a leaf Crawler (i.e., it has reached the maximum crawling depth).

  • All its child Crawlers have finished.

  • An error occurred (e.g., network error).

This results in a tree of Crawlers exploring website pages. When the Root Crawler finishes, all other Crawlers will have finished as well, indicating the end of the crawling process.

For more details, see the Crawler section.

Exporter Termination

Especially for Batch Exporters, notifying the end of the Crawling phase is crucial for proper job completion. Since the number of Exporters varies, we must wait for all of them to ensure everything concludes correctly. The Akka ask pattern is useful for this purpose. Each Exporter, including Stream Exporters, waits for a specific message (called SignalEnd) to end their execution. Upon receiving this message, Stream Exporters simply terminate, while Batch Exporters perform the final export. In both cases, Exporters reply to the Scooby actor once they're ready to terminate.

The termination interaction occurs sequentially between the Scooby actor and each Exporter, ensuring one Exporter finishes before the next begins. This approach is chosen for simplicity and to avoid conflicting export behaviors.

Detailed interaction between actors

Here you can see a detailed diagram representing the interactions between actors.

UserUserScoobyScoobyCoordinatorCoordinatorExporterExporterCrawlerCrawlerScraperScraperPagePage1.1start(config)1.2create()1.3create(exporterConfig)1.4create(crawlerConfig)2.1crawl(url)2.2create(scraperConfig)2.3scrape(document)alt[maxDepth > 0 case]2.4checkPage(document)2.5crawlResponse(links)alt[links.size > 0 case]2.5.1create()2.5.2crawl(url)[links.size == 0 case]2.5.3stop[maxDepth == 0 case]2.6stop2.7applyPolicy()2.8export(results)
Last modified: 07 August 2024