[Cryptography] filtering html

Sun Oct 15 18:30:33 EDT 2017

> On Oct 15, 2017, at 1:12 AM, James A. Donald <jamesd at echeque.com> wrote:
> 
> An arbitrary and possibly hostile web page passes through proxy or a server, which makes a record of it.
> 
> Is there any easy way to filter that web page, stripping out javascript and links to outside images and such, so that record is guaranteed to display the same way, or closely equivalent way, as the original?
> 
> Seems to me this is a job for an html compiler, that you need to parse it, filter the parse tree, and then regenerate the vanilla html document from the parse tree.  Which sounds like a great deal of work.

I'm not quite sure I understand what problem you're trying to solve, so if I say something off base or even obvious, forgive me.

Let's rewind back to some basics. HTML is a markup language, a way to express both content and display metadata. There are lots of other markup languages, like Runoff and all its derivatives (nroff, troff, etc.), Scribe and all its derivatives (Scribble, Texinfo, etc.), TeX and its relations (LaTeX, ConTeX, XeTeX, etc). One can even arguably throw in Postscript and its derivatives like PDF as well as RTF, but relatively few people write them directly. There are of course newer markup languages like Markdown and all of its dialects and derivatives. 

Many of these are compiled as opposed to interpreted, and I suppose it's hard to even say what the difference is, because many of the languages are not anything like Turing complete. Even the ones that are full programming languages (TeX, Postscript) are typically statically run; one can make a TeX document/program that generates a different output document every time it's run, but in general we don't do that. I have a friend who wrote a Postscript program to generate her business cards and every card had a uniquely generated fractal as a logo, and this is exceptional enough to call it out.

Getting back to HTML, it's a derivative of SGML which was (is?) used for documents back in the day by lots of organizations. I worked for a company that used SGML extensively, and once HTML came out, I basically wrote HTML as if it were SGML with features removed. XML is also an SGML derivative, as is EPUB since EPUB is a derivative of XML.

HTML, as we all know has links, and external references. These external references are evaluated at display time, and it's not unusual for a re-display of a page to generate slightly different content. But yeah – you can easily separate the rendering from this display and render the HTML in sandbox and then display safer image type. The Amazon Silk browser does (did?) this, and as I remember, Opera did something similar.

HTML also can have embedded scripts for active content and this, of course brings in difficulties. But yeah, you can filter out some or all of the scripts, as well as other content. Ad blockers and content filters do this and they work reasonably well. You could even provide some sort of active separation in a proxy. It's not always an easy task, but conceptually it's no different than many other things like VNC, X Windows, etc.

For that matter, many browsers are moving to an architecture that's not unlike that now, even. Each browser tab contains a completely different rendering and execution system. Yes, yes, it's not got the same isolation as running it on different hardware, but conceptually the isolation of execution and rendering goes to similar principles.

So yes, you can make an HTML compiler. Heck, pandoc (http://pandoc.org <http://pandoc.org/>) is a compiler / translator between many formats and you could (e.g.) compile HTML into PDF and then just display that. Heck, you could compile it into a PNG, for that matter. You could also print a web page and read it on paper, too.

Obviously, this ignores the active content. For HTML the way it is used today, this is an issue. I recently made a comment like, "It's impossible to use the web with Javascript turned off, these days" and while perhaps an exaggeration (okay, it's not *impossible*) it doesn't go all the way to hyperbole. There are a lot of ad blockers now that do at least some Javascript filtering.

Bottom line – sure you can do what you're saying, but it's easy to go into compilation to a degree to where you're no longer using a web browser. If the goal you're trying for is Safe Browsing, lots of people want that. I know I do, and it's a hard problem that lots of people are working on for some value of "safe."

	Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20171015/c3bf8e39/attachment.html>