Scooby Help

Scraper

A Scraper is a system entity tasked with analyzing page content to gather structured information. Its work is triggered by the Crawler, which provides the document to scrape. Once its analysis is finished, it notifies the Exporter with the result obtained and stops.

Interaction between other system entities can be depicted with the following diagram:

CrawlerCrawlerScraperScraperExporterExporterCreate(scraper configs)Scrape(document)Apply policies to documentExport(result)

Structure

«Actor»Crawlerexporter: ActorRef[ExporterCommands]scraper: ActorRef[ScraperCommands]scraperPolicy: ScraperPolicy[T]ScrapeDocumentfind(regExp: String): Seq[String]group(toGroup: Iterator[Regex.Match]): Seq[String]frontier(): Seq[URL]getAllLinkOccurrences(): Seq[URL]parseDocument(using parser: Parser[HTMLDom]): HTMLDomselect(selectors: String*): Seq[HTMLElement]getElementById(id: String): Option[HTMLElement]getElementsByTag(tag: String): Seq[HTMLElement]getElementsByClass(className: String): Seq[HTMLElement]getAllElements(): Seq[HTMLElement]ScraperCommandScrape(document:ScrapeDocument)ScraperPolicyT«Actor»Scraperexporter: ActorRef[ExporterCommands]policy: ScraperPolicy[T]scrape(document: ScrapeDocument): Iterable[T]«Actor»Exporterexport(result: Iterable[T]): Unit«uses»«uses»«signal»«uses»«creates»«signal»

Scraper Policy

A Scraper Policy is the transformation that the Scraper applies to the page provided by Crawler to gather structured information, which are then delivered to the Exporter entity.

Last modified: 07 August 2024