Solved: Preventing Timeout error using XmlDocument.Load and SgmlReader

I was working on a project that used the SgmlReader
to clean up an invalid web page as I scraped data from it. I was using the System.Net.WebRequest
class and related functionality.
However, because the page that I was calling used an xhtml DTD, yet included malformed xhtml in the document, it caused a timeout error when trying to load the document into an XmlDocument
using the SgmlReader
.
Here is my code (that caused a timeout error on line 21):
var req = WebRequest.Create("https://SomeWebsiteWithXhtmlDTDAndMalformedXhtml.com");
using (var res = req.GetResponse())
{
using (Stream responseStream = res.GetResponseStream())
{
using (StreamReader reader = new StreamReader(responseStream))
{
using (SgmlReader sgmlReader = new SgmlReader())
{
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.NameTable.Add("http://www.w3.org/1999/xhtml");
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
}
}
}
Because of the malformed document (many of the tags were missing their closing tag - or specifically, one of the meta tags was missing it's ending slash. For example, rather than this:
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
it used
<meta name="viewport" content="width=device-width, initial-scale=1.0">
Notice that the last slash is missing, making this an invalid XHTML element. Even though SgmlReader
is supposed to "fix" these types of issues, it seems that the DTD was causing the issue.
The fix involved breaking out the call to the web page from the WebRequest
object to using the System.Net.WebClient
class to grab the page as a string and read it in that way.
Here is the replacement code.
using (WebClient client = new WebClient())
{
string fipsDoc = client.DownloadString("https://SomeWebsiteWithXhtmlDTDAndMalformedXhtml.com");
fipsDoc = fipsDoc.Substring(fipsDoc.IndexOf(Environment.NewLine) + Environment.NewLine.Length);
StringReader reader = new StringReader(fipsDoc);
using (SgmlReader sgmlReader = new SgmlReader())
{
sgmlReader.DocType = "HTML";
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = false;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
}
The fix is in line #4 - I simply removed the first line of code from the file. That's where the DTD was being defined. I don't know for sure why it was causing a timeout, but I'm guessing it has something to do with the fact that SgmlReader
was making assumptions about the content.
I'm a bit unhappy that I had to read it in as a string, manipulate it, then pass it through a StringReader
, into the SgmlReader
, and finally into the document, only then to get processed by my business logic. However, this solution works for my specific use case, so I'm not going to sweat it.