Class Docx4jDriver

java.lang.Object
com.topologi.diffx.Docx4jDriver

public class Docx4jDriver extends Object
docx4j uses topologi's diffx project to determine the difference between two bits of WordML. (an xslt is then used to convert the diffx output to WordML with the changes tracked) If the two things being compared start or end with the same XML, diffx slices that off. After that, you are left with EventSequences representing the two things being compared (an event for the start and end of each element and attributes, and for each word of text). The problem is that performance drops off rapidly. For example, if each event sequence is: + under say 500 entries, time is negligible + 1800 entries long, calculating the LCS length to fill the matrix may take 17 seconds (on a 2.4GHZ Core 2 Duo, running from within Eclipse) + 3000 entries, about 95 seconds (under IKVM) + 3500 entries, about 120 seconds + 5500 entries, about 550 seconds (under IKVM) Ultimately, we should migrate to / develop a library which doesn't have this problem, and supports: - word level diff (diffx does, but Fuego doesn't but could) - 3 way merge - move (though why, can OpenXML represent a move?) An intermediate step might be to add an implementation of the Lindholm heuristically guided greedy matcher to the com.topologi.diffx.algorithm package. See the Fuego Core XML Diff and Patch tool project (which as at 19 June 2009, was offline). Could be relatively straightforward, since it also uses an event sequence concept. But in the meantime this class attempts to divide up the problem. The strategy is to look at the children of the nodes passed in, hoping to find an LCS amongst those. If we have that LCS, then (at least in the default case) we don't need to diff the things in the LCS, just the things between the LCS entries. I say 'default case' because in that case the LCS entries are each the hashcode of the diffx EventSequences. (But if you were operating on sdts, you might make them the sdt id.) This approach might work on the children of w:body (paragraphs, for example), or the children of an sdt:content. It could also help if you run it on two w:body, where all the w:p are inside w:sdts, provided you make use of the sdt id's, *and* the sliced event sequences inside the sdt's aren't too long. We use the eclipse.compare package for the coarse grained divide+conquer. TODO If any of the diffx sliced event sequence pairs are each > 2000 entries long, this will log a warning, and just return left tree deleted, right tree inserted. Or try to carve them up somehow? The classes in src/diffx do not import any of org.docx4j proper; keep it this way so that this package can be made into a dll using IKVM, and used in a .net application, without extra dependencies (though we do use commons-lang, for help in creating good hashcodes).