20040723

LocalNames Ripper Using HTMLParser

LocalNames is a project born out of the wiki world to ease the management of links. The principle is that a LocalNames enabled application like a wiki or blog will auto insert links to resources based on key phrases or words in the text.

I have produced a draft set of LocalNames for Java & J2EE here which the Java/Wiki community might find useful.

To produce some of these lists I used a ripper application written in Java, this uses the HTMLParser library to extract links from a page and format the output into a simple LocalNames format text file.

Here is the very simple program to do this:


package com.hughreid.localnames.ripper;

import java.io.FileWriter;
import java.io.IOException;

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.ObjectFindingVisitor;

public class Ripper {
/*
* This takes one argument and that is the url of the source to parse.
*/
public static void main (String[] args) throws ParserException, IOException {
Parser parser = new Parser (args[0]);
ObjectFindingVisitor visitor = new ObjectFindingVisitor (LinkTag.class);
parser.visitAllNodesWith (visitor);
Node[] links = visitor.getTags ();
FileWriter writer = new FileWriter("out.txt");
writer.write(" http://purl.net/net/localnames/\n");
writer.write(" NamesTable\n");
for (int i = 0; i < links.length; i++) {
LinkTag linkTag = (LinkTag)links[i];
writer.write(" \"" + linkTag.getLinkText () + "\" ");
writer.write(linkTag.getLink() + "\n");
}
}
}



For the benefit of those unfamiliar with Java:
To use it you need to download the HTMLParser and a Java SDK (e.g. 1.4.2). Add the htmlparser.jar and htmllexer.jar to your CLASSPATH environment variable and then compile the Java program using javac com/hughreid/localnames/ripper/Ripper.java. Then call java com.hughreid.localnames.ripper.Ripper http://www.awebsite.com/listoflinks.html and the program will create a file out.txt in the current working directory.

1 comment:

Anonymous said...

I'm likely going to simplify the LocalNames format.Code changes? Basically, for the code you just wrote, just take out the PURL line, the namestable line, and just take out the spaces indenting the blocks.

Yay!

And now it's dramatically easier to write a Java names description interpreter, if that's what you like.

See the new format description for details. I welcome comments, and questions. It's still flexible, and you certainly have influence over my decisions.