Scala has very good support for XML, including built-in XML parser and access to the platform-provided parsers.

Sometimes the documents that need to be parsed are in HTML, and XML parsers can not handle that, since the rules of element nesting and attribute values are different. TagSoup package by John Cowan is designed to bridge the gap. The question is: how to hook TagSoup into Scala’s XML parsing?

Google search came back with two relevant results: How to use TagSoup with Scala XML? and Processing real world HTML as if it were XML in scala, both by Florian Hars.

Unless I am missing something, at least in Scala 2.8 there is a simpler solution:

import scala.xml.{Elem, XML}
import scala.xml.factory.XMLLoader
    
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
    
    
object TagSoupXmlLoader {
    
  private val factory = new SAXFactoryImpl()
  
    
  def get(): XMLLoader[Elem] = {
    XML.withSAXParser(factory.newSAXParser())
  }
}

Strictly speaking, the class is not needed; one-liner

XML.withSAXParser(new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl().newSAXParser())

is all it takes! But, the object provides the scope where to put the code that configures features of the SAXFactoryImpl if such a need ever arises :)

Comments

Pavel Friday, September 24, 2010 6:43:00 AM

Thanks a lot, because “Processing real world HTML as if it were XML in scala” at the 2.8.0 scala’s time even don’t compile :)

sanj sahayam Monday, January 23, 2012 1:51:00 AM

Exactly what I was looking for! Thanks! :)

qu1j0t3 Friday, April 06, 2012 9:11:00 PM

Thankyou. This really helped me get a project started. And thankyou John Cowan for Tagsoup.

Siddhartha Saturday, January 05, 2013 5:24:00 AM

Thank you. Very useful.

Leonid Dubinsky Sunday, January 06, 2013 7:48:00 PM

Thanks for you kind words!