Scooby Help

Luca Rubboli

From an implementation point of view I mainly managed:

  • Scraper;

  • Scraper's Result;

  • DSL Exporter section;.

  • Tests suites related to previous topics.

Further details of implementation for the most relevant parts are described in the following sections.

Scraper

Scraper has been implemented starting from a standard class design, then enriched after a working version has been done with its actor extension.

Given a proper implementation of result class and scraper policy, its implementation results quite trivial. After its set up, it manages the scrape action of one document; after finishing the policy application, it signals the Exporter component with result obtained and stops.

/** * Class representing Scraper actor. * * @param exporterRouter the exporter router [[ActorRef]] where to redirect the scraping results * @param scrapePolicy the scraping policy used by the actor. * @tparam T type representing the [[DataResult]] type. */ class Scraper[T](exporterRouter: ActorRef[ExporterCommands], scrapePolicy: ScraperPolicy[T]): /** * Defines [[Scraper]]'s [[Behavior]]. * @return the [[Behavior]] */ def idle(): Behavior[ScraperCommands] = Behaviors.setup: context => Behaviors.receiveMessage: case ScraperCommands.Scrape(doc: ScrapeDocument) => Try: val res = resultFromPolicy(doc) exporterRouter ! ExporterCommands.Export(res) .fold(e => println(s"An error occurred while scraping: $e"), identity) Behaviors.stopped private def resultFromPolicy(argument: ScrapeDocument): Result[T] = Result(scrapePolicy(argument))

Scraper Policies

The design phase involving scraper policies aimed to obtain a modular, extensible and customizable implementation. In order to obtain a general, but yet effective, policy, it consists of a function that maps a document (so called ScrapeDocument, which contains useful API for scraping) into an iterable of generic type.

/** * A type representing a function that extract an [[Iterable]] used * to build [[Result]] from a [[Document]]. */ type ScraperPolicy[T] = ScrapeDocument => Iterable[T]

During test phase, some common policies have arisen, so I decided to provide an easier way to access them; the same approach has been followed even during DSL keyword definitions.

/** * Utility for [[ScraperPolicy]] based on selectBy attribute, given selectors specified. * Admissible values for selectBy are id, tag, class, css and regex. * * @param selectors a [[Seq]] of selectors used in scraper policy. * @param selectBy a selector to specify the policy. * @return the selected policy with specified selectors. */ def scraperPolicy(selectors: Seq[String], selectBy: String): ScraperPolicy[String] = (scraper: ScrapeDocument) => selectBy match case "id" => selectors.map(scraper.getElementById).map(_.fold("")(_.outerHtml)).filter(_.nonEmpty) case "tag" => selectors.flatMap(scraper.getElementsByTag).map(_.outerHtml) case "class" => selectors.flatMap(scraper.getElementsByClass).map(_.outerHtml) case "css" => selectors.flatMap(scraper.select(_)).map(_.outerHtml) case "regex" => selectors.flatMap(scraper.find) case _ => throw Error(s"Not yet implemented policy by $selectBy")

Moreover, following this approach, an extension method to concatenate policies has been added.

Result

Result has been incrementally implemented, starting from a simple container, enriched with aggregation techniques to promote Exporter's scaling. In details, it has been implemented through a case class, promoting immutability. Due to exporter requirements on both Batch and Streaming aggregation strategies, both kind of updates have been provided, as well as an aggregation method which allows usage of multiple Exporters.

/** * Class representing [[Scraper]]'s results implementation. * * @param data * representing actual result. * @tparam T * representing result's type. */ final case class Result[T] (data: Iterable[T]) extends DataResult[T]: override def updateStream(data: T)(using aggregator: ItAggregator[T]): Result[T] = Result(aggregator.aggregateStream(this.data, data)) override def updateBatch(data: Iterable[T])(using aggregator: ItAggregator[T]): Result[T] = Result(aggregator.aggregateBatch(this.data, data)) override def aggregate[A <: DataResult[T]](result: A)(using aggregator: ItAggregator[T]): Result[T] = updateBatch(result.data)

Once again, as matter of usability, standard Iterable aggregators has been provided.

DSL

By DSL side, I managed Exporter entity, defining keywords and implementation.

An example of Exporter configuration with DSL is:

exports: batch: strategy: results get(el => (el.tag, el.text)) output: toFile("test.txt") withFormat json aggregate: _ ++ _ streaming: results get tag output: toConsole withFormat text

To give a brief taste of DSL implementation, I report the following snippet of code:

/** * Build the [[Exporter]] batch context. * @param context the [[StrategiesContext]] containing exporting strategies. * @tparam T the [[Result]]'s type. * @return the [[BatchExportationContext]] built. */ def batch[T](using context: StrategiesContext[T]): BatchExportationContext[T] = BatchExportationContext[T](context) /** * Type alias representing the "Batch" section under the "exports" part of the DSL * @tparam T type of results returned by the scraping. */ private type BatchDefinitionScope[T] = BatchSettingContext[T] ?=> Unit /** * The exporter batch technique's context. * @param context the context used to set the [[BatchExporting]] configuration. * @tparam T the [[Result]]'s type. */ case class BatchExportationContext[T](context: StrategiesContext[T]): /** * Builder used to set the [[BatchExporting]] configuration. * @param block the function used to set the [[BatchExporting]] configuration. */ inline infix def apply(block: BatchDefinitionScope[T]): Unit = catchRecursiveCtx[BatchSettingContext[?]]("batch") visitCtxUnsafe(block) /** * Unsafe version of [[BatchExportationContext.apply]]. * @param block the function used to set the [[BatchExporting]] configuration. */ private def visitCtxUnsafe(block: BatchDefinitionScope[T]): Unit = given batchStrategyContext: BatchSettingContext[T] = BatchSettingContext[T]( ExportingBehaviors.writeOnConsole(Formats.string), AggregationBehaviors.default) block context.exportingStrategies ++= Seq(BatchExporting( batchStrategyContext.policy, batchStrategyContext.aggregation ))

By means of BatchExportationContext case class, a BatchSettingContext is built, checking if not previously built, failing otherwise, and filled with default behavior writeOnConsole and default aggregation. After that, BatchDefinitionScope is consumed; it applies policies and aggregation functions defined by users in batchStrategyContext, that is then exported in global context configuration.

It is worth mentioning that multiple batch and streaming strategies can be specified, resulting in the execution of each one, while assertions on DSL configuration's structure are still done, in fact for example it's not possible to define an export block inside an already defined one.

Tests

All features implementation have been preceded by a robust test phase. In details, all my main implementations have been tested with ScalaTest suite. Beyond my implementation, I tested also the application with standard configurations against the one configured using DSL syntax:

"Application with DSL configuration and standard configurations" should "obtain the same result" in : val appDSL = scooby: config: network: Timeout is timeout MaxRequests is maxRequest headers: auth._1 to auth._2 options: MaxDepth is maxDepth MaxLinks is maxLinks crawl: url: this.url policy: linksPolicy scrape: scrapePolicyDSL(scrapeToIter) exports: batch: strategy: batchStrategyDSL(filePathDSL.toString) aggregate: batchAggregation val appStandard = ScoobyRunnable( Configuration( CrawlerConfiguration( URL(url), ExplorationPolicies.allLinks, maxDepth, ClientConfiguration(timeout, maxRequest, Map(auth)) ), ScraperConfiguration(scrapeToIter), ExporterConfiguration(Seq( BatchExporting( (res: Result[HTMLElement]) => batchStrategy(filePathStandard.toString)(res.data), (res1: Result[HTMLElement], res2: Result[HTMLElement]) => Result(batchAggregation(res1.data, res2.data)) ))), CoordinatorConfiguration(maxLinks) ) ) val resultDSL: Map[String, Int] = resultsToCheckAsMapWithSize( Await.result(appDSL.run(), 10.seconds), _.tag ) val resultStandard: Map[String, Int] = resultsToCheckAsMapWithSize( Await.result(appStandard.run(), 10.seconds), _.tag ) resultDSL shouldBe resultStandard

Cucumber features

The domain entity Scraper has been tested also using cucumber suite test, by defining features and their implementation.

Feature: Scraper data filtering. Scenario: No matching after data filtering Given I have a scraper with a proper configuration And I have a document with no matching When The scraper applies the policy Then It should send an empty result Given("""I have a scraper with a proper configuration""") : () => val selectors: Seq[String] = Seq("li", "p") scraperActor = testKit.spawn(Scraper(exporterProbe.ref, ScraperPolicies.scraperPolicy(selectors, "tag"))) And("""I have a document with no matching""") : () => docContent = s""" |<html lang="en"> |<head> | <title>Basic HTML Document</title> |</head> |</html> |""".stripMargin docUrl = URL.empty scrapeDocument = ScrapeDocument(docContent, docUrl) result = Result.empty[String] When("""The scraper applies the policy""") : () => scraperActor ! ScraperCommands.Scrape(scrapeDocument) Then("""It should send an empty result""") : () => exporterProbe.expectMessage(ExporterCommands.Export(result))
Last modified: 07 August 2024