Get started

To start using Scooby in a new SBT project, you need to manually add the library.

Generate a new project using SBT.
Download the JAR from the latest release of the Scooby library.
Create a new lib folder inside your SBT project.
Place the downloaded JAR inside the lib folder you've just created.
Create a class that extends either org.unibo.scooby.dsl.ScoobyEmbeddable or org.unibo.scooby.dsl.ScoobyApplication.

ScoobyEmbeddable is a Scala trait that can be added to a class to use the Scooby DSL without it being executable. The scooby keyword will then return a Future containing the result of the scraping. ScoobyApplication, on the other hand, can be extended by a Scala object to be executable directly.

Here's the difference in their usage:

class MyClass extends ScoobyEmbeddable:

  val app: ScoobyRunnable[?] = scooby:
    ...

  val result: Result[?] = Await.result(app.run(), Duration.Inf)

object Application extends ScoobyApplication:

  scooby:
    ...

Here's instead a full example of the usage with ScoobyApplication.

import org.unibo.scooby.dsl.ScoobyApplication
import scala.concurrent.duration.DurationInt

object MyObject extends ScoobyApplication:

  scooby:
    config:
      network:
        Timeout is 9.seconds
        MaxRequests is 10
        headers:
          "User-Agent" to "Scooby/1.0-alpha (https://github.com/PPS-22-Scooby/PPS-22-Scooby)"
      options:
        MaxDepth is 2
        MaxLinks is 20

    crawl:
      url:
        "https://www.myTestUrl.com"
      policy:
        hyperlinks not external
    scrape:
      elements
    exports:
      batch:
        strategy:
          results get(el => (el.tag, el.text)) output:
            toFile("test.json") withFormat json
        aggregate:
          _ ++ _
      streaming:
        results get tag output:
          toConsole withFormat text

Customization

Provided DSL is open to customization, we offer a brief introduction to explore possible configurations.

Network

In order to enlarge visit to websites which require user authentication, it is possible to define multiple headers in headers section as

                headers:
                  "my-header-name-1" to "my-header-value-1"
                  "my-header-name-2" to "my-header-value-2"
            

Crawler

It is possible to define custom policies, which must adhere to type CrawlDocument ?=> Iterable[URL]. An example could be:

                policy:
                  allLinks not external
            

Scraper

It is possible to define custom policies, which must adhere to type ScrapeDocument ?=> Iterable[T]. It is also possible to mix policies using boolean filter conditions. An example could be:

scrape:
  elements that :
    haveAttributeValue("href", "level1.1.html") and haveClass("amet") or followRule {
      element.id == "ipsum"
    }

Exporter

It is possible to define both batch and streaming strategies, even multiple times, concatenating their effects. An example could be:

exports:
  batch:
    strategy:
      results get(el => (el.tag, el.text)) output:
        toFile("testJson.txt") withFormat json

    aggregate:
      _ ++ _
  batch:
    strategy:
      results get(el => (el.tag, el.text)) output:
        toFile("testText.txt") withFormat text

    aggregate:
      _ ++ _
  streaming:
    results get tag output:
      toConsole withFormat text

When output is configured toFile, it's possible to define preferred file action, between Append (append results to already existing text in file) and Overwrite (which delete previous content of the file). Default behavior if not specified is Overwrite.

exports:
  batch:
    strategy:
      results get(el => (el.tag, el.text)) output:
        toFile("testJson.txt", Append) withFormat json

    aggregate:
      _ ++ _
  batch:
    strategy:
      results get(el => (el.tag, el.text)) output:
        toFile("testText.txt", Overwrite) withFormat text

    aggregate:
      _ ++ _
  streaming:
    results get tag output:
      toConsole withFormat text

Last modified: 07 August 2024