HTML-EXTRACT - Extract plain text from HTML

This was written in 2002/2003 for a customer who wanted to have a simple Unix-like "filter" to extract the textual content from HTML files. It was implemented in CLISP due to its availability on many different platforms but contains virtually no implementation-dependent code and should thus be portable to other Common Lisp implementations without much effort.

HTML-EXTRACT can be downloaded from http://weitz.de/files/html-extract.tar.gz. It comes with a BSD-style license so you can basically do with it whatever you want.

To build HTML-EXTRACT on a Unix-like OS you just have to execute the shell script build.sh in the HTML-EXTRACT directory - you might have to adjust the CLISP variable there first. This'll result in a small executable html-extract that you can put into, say, /usr/local/bin and use like this:

html-extract <input.html >output.txt

Here, input.html is an arbitrary HTML file and output.txt will be the result of stripping all HTML tags off of this file.

$Header: /usr/local/cvsrep/html-extract/doc/index.html,v 1.1.1.1 2005/09/22 22:09:22 edi Exp $

BACK TO MY HOMEPAGE