XML Streaming library for Scala (xs4s) Maven Central

Capabilities

xs4s enables the processing of large (multi-gigabyte) XML files in Scala, for example .xml.gz files straight from Wikipedia (example below), without running out of memory.

In terms of specific features, xs4s offers:

  • Scala-friendly utilities around the javax.xml.stream.events API.
  • A mapping from the StAX to scala.xml.Elem and other Scala XML classes.
  • An alternative method of parsing XML to scala.xml.XML.load(), for example assert(xs4s.XML.loadString("<test/>") == <test/>).
  • An integration with FS2 and ZIO for pure FP streaming.

Release notes / change log

Version Date Changes
v0.9.1 2021-07-27 Scala 3.0.1 support; FS2 v3 (cats-effect 3) support; fs2 performance improvement
v0.8.7 2021-03-16 Scala 3.0.0-RC1 support
v0.8.5 2020-12-07 Latest ScalaTest
v0.8.0 2020-07-02 ZIO support; latest FS2
v0.7.0 2020-05-16 FS2 support; latest libraries; Scala 2.13 support
v0.5 2017-12-07 Cross-compile to both Scala 2.12 and 2.11
v0.4 2017-10-05 Update to Scala 2.12
v0.3 2016-09-19 Upgrades to many examples and slimming down API
v0.2 2016-04-05 Simplify code
v0.1 2015-02-03 Initial release

How it does it

It uses the standard XML API (https://github.com/FasterXML/woodstox) as a back-end. It gradually forms a partial tree, and based on a user-supplied function ("query"), it will materialise that partial tree into a full tree, which will return to the user.

Getting started

Add the following to your build.sbt (compatible with Scala 3.0.1, Scala 2.13 and 2.12 series):

libraryDependencies += "com.scalawilliam" %% "xs4s-core" % "0.9.1"
// for cats-effect 2
libraryDependencies += "com.scalawilliam" %% "xs4s-fs2" % "0.9.1"
// for cats-effect 3
libraryDependencies += "com.scalawilliam" %% "xs4s-fs2v3" % "0.9.1"
libraryDependencies += "com.scalawilliam" %% "xs4s-zio" % "0.9.1"

Examples

FS2 Streaming

Then, you can implement functions such as the following (BriefFS2Example - note the explicit types are for clarity):

/**
  *
  * @param byteStream Could be, for example, fs2.io.readInputStream(inputStream)
  * @param blocker obtained with Blocker[IO]
  */
def extractAnchorTexts(byteStream: Stream[IO, Byte]): Stream[IO, String] = {

  /** extract all elements called 'anchor' **/
  val anchorElementExtractor: XmlElementExtractor[Elem] =
    XmlElementExtractor.filterElementsByName("anchor")

  /** Turn into XMLEvent */
  val xmlEventStream: Stream[IO, XMLEvent] =
    byteStream.through(byteStreamToXmlEventStream())

  /** Collect all the anchors as [[scala.xml.Elem]] */
  val anchorElements: Stream[IO, Elem] =
    xmlEventStream.through(anchorElementExtractor.toFs2PipeThrowError)


  /** And finally extract the text contents for each Elem */
  anchorElements.map(_.text)
}

ZIO Streaming

Then, you can implement functions such as the following (BriefZIOExample - note the explicit types are for clarity):

/**
  *
  * @param byteStream Could be, for example, zio.stream.Stream.fromInputStream(inputStream)
  * @return
  */
def extractAnchorTexts[R <: Blocking](byteStream: ZStream[R, IOException, Byte]):
                                                     ZStream[R, Throwable, String] = {
  /** extract all elements called 'anchor' **/
  val anchorElementExtractor: XmlElementExtractor[Elem] =
    XmlElementExtractor.filterElementsByName("anchor")

  /** Turn into XMLEvent */
  val xmlEventStream: ZStream[R, Throwable, XMLEvent] =
    byteStream.via(byteStreamToXmlEventStream()(_))

  /** Collect all the anchors as [[scala.xml.Elem]] */
  val anchorElements: ZStream[R, Throwable, Elem] =
    xmlEventStream.via(anchorElementExtractor.toZIOPipeThrowError)

  /** And finally extract the text contents for each Elem */
  anchorElements.map(_.text)
}

Plain Iterator streaming

Alternatively, we have a plain-Scala API, especially where you have legacy Java interaction, or you feel uncomfortable with pure FP for now: BriefPlainScalaExample.:

def extractAnchorTexts(sourceFile: File): Unit = {
  val anchorElementExtractor: XmlElementExtractor[Elem] =
    XmlElementExtractor.filterElementsByName("anchor")
  val xmlEventReader = XMLStream.fromFile(sourceFile)
  try {
    val elements: Iterator[Elem] =
      xmlEventReader.extractWith(anchorElementExtractor)
    val text: Iterator[String] = elements.map(_.text)
    text.foreach(println)
  } finally xmlEventReader.close()
}

Advanced Wikipedia example

This example counts the popularity of Wikipedia anchors from their abstract documentation.

Many things all at once:

The main example is in FindMostPopularWikipediaKeywordsFs2App or FindMostPopularWikipediaKeywordsZIOApp. There is also a plain Scala example (using Iterator) in FindMostPopularWikipediaKeywordsPlainScalaApp.

$ git clone https://github.com/ScalaWilliam/xs4s.git
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsFs2App"
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsZIOApp"
$ sbt "examples/runMain xs4s.example.FindMostPopularWikipediaKeywordsPlainScalaApp"

This can consume 100MB files or 4GB files without any problems. And it does it fast. It converts XML streams into Scala XML trees on demand, which you can then query from.