Lucene/JapaneseAnalyser/Sen、辞書にすごく長い単語が含まれてるとその単語を含んだドキュメントを追加する際にIndexOutOfBoundsExceptionで失敗する

環境:sen 1.2.2.1
IndexOutOfBoundsExceptionって明らかにバグやんけ。

エラーメッセージ

java.lang.RuntimeException: java.lang.IndexOutOfBoundsException
	at net.java.sen.Dictionary.getPosInfo(Dictionary.java:149)
	at net.java.sen.Viterbi.analyze(Viterbi.java:134)
	at net.java.sen.StringTagger.analyze(StringTagger.java:180)
	at net.java.sen.StreamTagger.hasNext(StreamTagger.java:109)
	at org.apache.lucene.analysis.ja.sen.SenTokenizer.next(SenTokenizer.java:45)
	at org.apache.lucene.analysis.ja.POSFilter.next(POSFilter.java:73)
	(略)
	at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
	at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:107)
	at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:219)
	at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:95)
	at org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java:1013)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1001)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
	(以下略)

対処法

net.java.sen.Dictionary#getPosInfo内のループを、

	while (ffd.read(b, cnt, 1) != -1 && b[cnt] != (byte) '\0')
		cnt++;

から

	while (ffd.read(b, cnt, 1) != -1 && b[cnt] != (byte) '\0') {
		cnt++;
		if(b.length <= cnt) {
			byte new_b[]=new byte[b.length*2];
			for(int i=0;i<b.length;i++)
				new_b[i]=b[i];
			b=new_b;
		}
	}

に変更

問題解決の経緯

一見Dictionary#getPosInfo内での配列操作が原因の例外に見えるがそれは罠、

    } catch (Exception e) {
      throw new RuntimeException(e.toString());
    }

なるコードがスタックトレースを握りつぶしてる(IOExceptionをRuntimeExceptionに変換したかったんだろう、意図としては)。
実際のところは、

 	at java.nio.Buffer.checkBounds(Unknown Source)
 	at java.nio.ByteBuffer.get(Unknown Source)
 	at java.nio.DirectByteBuffer.get(Unknown Source)
 	at net.java.sen.io.MappedBufferedReader.read(MappedBufferedReader.java:73)

が例外の元。java.nio.DirectByteBuffer#getに渡すパラメータがおかしいようだ。
問題のコードは、

	//Dictionary#get
	int cnt = 0;
	byte b[] = new byte[256];
	ffd.seek(f);
	while (ffd.read(b, cnt, 1) != -1 && b[cnt] != (byte) '\0')
		cnt++;

このへん。配列の境界をチェックしてないように見えるが、例外が投げられてるのはffd.read()内部。
ffd.readの帰り値を-1と比較してるところ、一見正しそうに見えるが

	//net.java.sen.io.MappedBufferedReader
	public int read(byte b[], int start, int length) {
		map.get(b, start, length);
		return length; // !!!
	}

インターフェース定義部にはコメントの類が一切書かれておらず、何が正しい仕様なのかは謎。このへん明らかに怪しい……だがこれはファイル終端の判定っぽいので別件くさい。
さて、DirectByteBuffer#getがIndexOutOfBoundsExceptionを投げるのは

BufferUnderflowException - If there are fewer than length bytes remaining in this buffer
IndexOutOfBoundsException - If the preconditions on the offset and length parameters do not hold
[http://java.sun.com/j2se/1.5.0/docs/api/java/nio/ByteBuffer.html#get(byte[], int, int)]

バッファ終端を過ぎて読もうとしてたら別の例外だよなー、あれ、ひょっとして

	//net.java.sen.io.MappedBufferedReader
	public int read(byte b[], int start, int length) {
		if(b.length <= start) throw new IllegalArgumentException("you are an idiot: b.length <= start");
		map.get(b, start, length);
		return length;
	}

java.lang.IllegalArgumentException: you are an idiot: b.length <= start

YES!!
原因は確定したのでバッファ再確保処理を追加:

	while (ffd.read(b, cnt, 1) != -1 && b[cnt] != (byte) '\0') {
		cnt++;
		if(b.length <= cnt) {
			byte new_b[]=new byte[b.length*2];
			for(int i=0;i<b.length;i++)
				new_b[i]=b[i];
			b=new_b;
		}
	}

落ちなくなった＾＾

まとめ

あちこちにマジックナンバー256が頻出・ほとんどテスト書いてない・オフィシャルサイトがSPAMに乗っ取られてると、有名なプロダクトな割にかなり危険な状況。こわい。
GoSenなる派生プロジェクトがある。

Furigana processing support
Source upgraded to Java 5
Improved GPL compatibility through removal of the dependency on commons-logging
Pure Java dictionary compilation with no dependency on Perl
Greatly reduced heap usage during dictionary compilation, allowing compilation with the default Java heap settings
EUC-JISX0213 character set support allowing correct compilation of the Ipadic dictionary
Significantly improved text analysis speed
Support for morphemes within Ipadic with multiple alternative readings
Full Javadoc class documentation
JUnit test suite

ほとんど読んでないけど、かなりがっつり書き直してある印象。こっちをためす価値はある、かも。