Reverse API using local objects as method containers
Category domino
Over the past couple days I've been striving for an ideal way to deal with a large volume of documents that are never saved in a database. This is probably an infrequent scenario (in 9.5 years, I can't recall a similar situation), but in case any of you deal with this often, I'd like feedback on which approaches you prefer.
Without diving too deep into the inconsequential, some background: the application in question pulls large recordsets from a remote system. The import functions have been encapsulated into a script library, since several imports are handled by separate agents. In one case, the database is simply cleared and repopulated, so the import function just creates a document for each record, then returns all of them as a NotesDocumentCollection. Another import agent, however, is intended to compare existing documents and process only updates and new creations. So when the import returns the collection, the agent loops through each and either updates an existing document or creates a new one, then purges the imported collection. Runs great on a small collection, but when it's pulling in tens of thousands, there are two problems: while it's processing all of them, the server's indexer is pegged updating views, and when it's purged the stubs, the database is 50% whitespace. This runs hourly, so by the time compact runs each night, the database is sitting at 4% used... and it's 6 GB.
My first idea was to create a new NotesDocumentCache class. It's very similar in its API (addDocument, getFirstDocument, getNextDocument, etc.) and behavior to NotesDocumentCollection, with one notable exception: it doesn't care whether a document has been saved yet. NotesDocumentCollection is essentially just a table of NoteID's, so if you try to add an in-memory NotesDocument to a collection before it's saved, it throws an error, beccause a document is assigned a UNID as soon as it's created, but doesn't get a NoteID until it's saved. The secret sauce in this class is in the form of offset List indexes (which Nathan has deemed "medium clever" on a cleverness rating scale of mild, medium, and hot) that allow getNextDocument / getPrevDocument to rapidly return the appropriate handle without the performance loss of getNthDocument. I like the way it turned out, and will probably use it from time to time when I want to cache a group of documents that may or may not have been saved yet. This also addresses a request I've seen from numerous folks to be able to instantiate an empty collection without using db.Search. But in this case it didn't fix my problem because it doesn't scale well... trying to store 116,000 NotesDocument handles in memory at the same time makes Domino angry.

What I needed was not a way to hold all the documents in memory and then process all of them, but rather a way to process each and then discard it. But that processing would be different for every agent calling the import function. I could subclass the class containing the import function, and have a separate subclass for each agent. But I got to thinking about how JavaScript handles similar behavior: since functions are objects in JavaScript, you can actually pass a function as a parameter to another function, which, after running its own code, can then execute the passed function on a local variable. This is great for iterative processing, because you can define one function for looping through whatever you want to iterate, then pass any function that you want to run on each member. Very dynamic, very powerful. But LotusScript functions aren't objects, so you can't pass them to other functions... who'd have thought I'd ever wish that LotusScript were more like JavaScript?
But... while you can't pass functions to other functions, you can pass objects that contain functions. So I made a minor change to the script library to give it what I consider to be a "reverse API": when the import function is called, it's passed an instance of a DocumentIterator class that is defined within the agent, which has a .processDocument method. The import calls that method on every document it loads, and the method handles the agent-specific processing for the document. In the first example above, that just increments a global import count for logging purposes and saves the document. In the second, it does all the extra stuff. Fast, lightweight, yet flexible.
So what do you think? Is that a good model? Or should the importer class contain an empty processDocument method, overridden in a subclass for each agent?
In case you might find the NotesDocumentCache class useful, here it is.
Over the past couple days I've been striving for an ideal way to deal with a large volume of documents that are never saved in a database. This is probably an infrequent scenario (in 9.5 years, I can't recall a similar situation), but in case any of you deal with this often, I'd like feedback on which approaches you prefer.
Without diving too deep into the inconsequential, some background: the application in question pulls large recordsets from a remote system. The import functions have been encapsulated into a script library, since several imports are handled by separate agents. In one case, the database is simply cleared and repopulated, so the import function just creates a document for each record, then returns all of them as a NotesDocumentCollection. Another import agent, however, is intended to compare existing documents and process only updates and new creations. So when the import returns the collection, the agent loops through each and either updates an existing document or creates a new one, then purges the imported collection. Runs great on a small collection, but when it's pulling in tens of thousands, there are two problems: while it's processing all of them, the server's indexer is pegged updating views, and when it's purged the stubs, the database is 50% whitespace. This runs hourly, so by the time compact runs each night, the database is sitting at 4% used... and it's 6 GB.
My first idea was to create a new NotesDocumentCache class. It's very similar in its API (addDocument, getFirstDocument, getNextDocument, etc.) and behavior to NotesDocumentCollection, with one notable exception: it doesn't care whether a document has been saved yet. NotesDocumentCollection is essentially just a table of NoteID's, so if you try to add an in-memory NotesDocument to a collection before it's saved, it throws an error, beccause a document is assigned a UNID as soon as it's created, but doesn't get a NoteID until it's saved. The secret sauce in this class is in the form of offset List indexes (which Nathan has deemed "medium clever" on a cleverness rating scale of mild, medium, and hot) that allow getNextDocument / getPrevDocument to rapidly return the appropriate handle without the performance loss of getNthDocument. I like the way it turned out, and will probably use it from time to time when I want to cache a group of documents that may or may not have been saved yet. This also addresses a request I've seen from numerous folks to be able to instantiate an empty collection without using db.Search. But in this case it didn't fix my problem because it doesn't scale well... trying to store 116,000 NotesDocument handles in memory at the same time makes Domino angry.

What I needed was not a way to hold all the documents in memory and then process all of them, but rather a way to process each and then discard it. But that processing would be different for every agent calling the import function. I could subclass the class containing the import function, and have a separate subclass for each agent. But I got to thinking about how JavaScript handles similar behavior: since functions are objects in JavaScript, you can actually pass a function as a parameter to another function, which, after running its own code, can then execute the passed function on a local variable. This is great for iterative processing, because you can define one function for looping through whatever you want to iterate, then pass any function that you want to run on each member. Very dynamic, very powerful. But LotusScript functions aren't objects, so you can't pass them to other functions... who'd have thought I'd ever wish that LotusScript were more like JavaScript?
But... while you can't pass functions to other functions, you can pass objects that contain functions. So I made a minor change to the script library to give it what I consider to be a "reverse API": when the import function is called, it's passed an instance of a DocumentIterator class that is defined within the agent, which has a .processDocument method. The import calls that method on every document it loads, and the method handles the agent-specific processing for the document. In the first example above, that just increments a global import count for logging purposes and saves the document. In the second, it does all the extra stuff. Fast, lightweight, yet flexible.
So what do you think? Is that a good model? Or should the importer class contain an empty processDocument method, overridden in a subclass for each agent?
In case you might find the NotesDocumentCache class useful, here it is.

Comments
We originally had all queries running as a single agent, but for performance reasons split it out into several. This is because the data returned by one is rarely updated, so they want it to run nightly... the other data set changes rather frequently, so they want that agent to run every hour.
Posted by Tim Tripcony At 10:14:33 AM On 08/12/2007 | - Website - |
If you are importing n records and processing them into n saved/unsaved documents and then processing against an existing set of documents, you've just jumped from O(n) to O(2n). Why?
Rather process in-step from your original data set. This may work for your "new import" case as well, if analysis shows there is usually overlap.
What processing do you do on the collection that's returned? Because that's just another O(n) that could potentially be folded into the import.
I take it the existence of Documents means that the user is comparing data sets... Have a look at the requirements, do you really need to store this data set, or just summary data? Do you even have to do two or more queries, could you not roll into one?
Yes patterns and OO techniques can be helpfull here, but only once you've addressed the algorithmic issue.
Posted by Colin Macdonald At 08:08:15 AM On 08/02/2007 | - Website - |
do you know design patterns?
{ Link }
In this context, I propose, you have look at these three behavioral patterns:
Command pattern: Command objects encapsulate an action and its parameters
{ Link }
Strategy pattern: Algorithms can be selected on the fly
{ Link }
Visitor pattern: A way to separate an algorithm from an object
{ Link }
We all can learn much from those design patterns, since their solve a lot of our problems in a "best practice" way.
Another question: Why do you create (heavy-weight) NotesDocument objects instead of using a custom class?
You could use a List of those objects with the lookup key as the key for the list.
Just my 2 cents
Thomas
{ Link }
Posted by Thomas Bahn At 05:29:46 AM On 08/01/2007 | - Website - |