Scooby Help

Crawler

A Crawler is a system entity responsible for searching explorable links inside a web page. It interacts with the Coordinator to validate the found urls and is responsible for creating scrapers, for extracting data from a page and new crawlers exploring new urls.

We can summarize the interaction between the Crawler and the other components with the following:

CrawlerCrawlerCoordinatorCoordinatorScraperSubcrawlerCrawl(URL)Extract link urlsCheckPages(list of extracted url)Check linksCrawlerCoordinatorResponse(valid links)Scrape(Document)Scraperloop[valid links]Crawl(URL)Subcrawler

Each time a Crawler found a new valid url it spawns a new child crawler that will analyze it. When the analysis of the page is complete, a crawler will continue to signal it to the parent, then when a crawler no longer has an active child it's removed from the system.

Structure

CrawlerCommandCrawl(url:URL)CrawlerCoordinatorResponse(result: Iterator[URL])ChildTerminated()«Case Class»ClientConfigurationExplorationPolicy«Actor»Crawlercoordinator: ActorRef[CoordinatorCommand],exporter: ActorRef[ExporterCommands],explorationPolicy: ExplorationPolicyclientConfiguration: ClientConfiguration,«Actor»Coordinator«Actor»Scraper«uses»«uses»«uses»«signal»«signal»«creates»«creates»
Last modified: 07 August 2024