[Mayan EDMS: 1672] Indexing speed improvements

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1672] Indexing speed improvements

MacRobb Simpson
I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.

Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:
    def rebuild_all_indexes(self):
        from .models import Index
       
        for index in Index.objects.filter(enabled=True):
            print 'indexing',index
            #Delete nodes applicable to index
            print 'deleting nodes'
            for instance_node in self.filter(id=index.id):
                instance_node.delete()
            #Delete empty nodes
            self.delete_empty_index_nodes()  
            print 'adding index node'
            #Add index node
            root_instance, created = self.get_or_create(
                index_template_node=index.template_root, parent=None
            )
            print 'indexing documents...'
            docsIndexed = 0
            #Reindex each document
            for document in Document.objects.filter(document_type=index.document_types.all()):
               
                #Add index nodes?
                for template_node in index.template_root.get_children():
                    self.cascade_eval(document, template_node, root_instance)
                docsIndexed += 1
                if docsIndexed % 10 == 0:
                    print 'indexing document',document,docsIndexed,'completed'
All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.


Anyone got any other improvement ideas or potential pitfalls that this could cause?

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1742] Re: Indexing speed improvements

rosarior
Administrator
That's great! Going through your changes to see how much I can move upstream.

On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote:
I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.

Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:
    def rebuild_all_indexes(self):
        from .models import Index
       
        for index in Index.objects.filter(enabled=True):
            print 'indexing',index
            #Delete nodes applicable to index
            print 'deleting nodes'
            for instance_node in self.filter(id=<a href="http://index.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;">index.id):
                instance_node.delete()
            #Delete empty nodes
            self.delete_empty_index_nodes()  
            print 'adding index node'
            #Add index node
            root_instance, created = self.get_or_create(
                index_template_node=index.template_root, parent=None
            )
            print 'indexing documents...'
            docsIndexed = 0
            #Reindex each document
            for document in Document.objects.filter(document_type=index.document_types.all()):
               
                #Add index nodes?
                for template_node in index.template_root.get_children():
                    self.cascade_eval(document, template_node, root_instance)
                docsIndexed += 1
                if docsIndexed % 10 == 0:
                    print 'indexing document',document,docsIndexed,'completed'
All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.


Anyone got any other improvement ideas or potential pitfalls that this could cause?

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1743] Re: Indexing speed improvements

rosarior
Administrator
Doing some tests I've hit several regressions and a few race conditions (without the 'document_indexing_task_do_rebuild_all_indexes' lock, deleting a document would delete it's index instance if it is empty even while an index is being rebuilt).
The entire indexing locking workflow will need to be remade too. This refactor is bigger than initially expected.  

On Saturday, May 27, 2017 at 11:01:56 AM UTC-4, Roberto Rosario wrote:
That's great! Going through your changes to see how much I can move upstream.

On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote:
I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.

Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:
    def rebuild_all_indexes(self):
        from .models import Index
       
        for index in Index.objects.filter(enabled=True):
            print 'indexing',index
            #Delete nodes applicable to index
            print 'deleting nodes'
            for instance_node in self.filter(id=<a href="http://index.id" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;">index.id):
                instance_node.delete()
            #Delete empty nodes
            self.delete_empty_index_nodes()  
            print 'adding index node'
            #Add index node
            root_instance, created = self.get_or_create(
                index_template_node=index.template_root, parent=None
            )
            print 'indexing documents...'
            docsIndexed = 0
            #Reindex each document
            for document in Document.objects.filter(document_type=index.document_types.all()):
               
                #Add index nodes?
                for template_node in index.template_root.get_children():
                    self.cascade_eval(document, template_node, root_instance)
                docsIndexed += 1
                if docsIndexed % 10 == 0:
                    print 'indexing document',document,docsIndexed,'completed'
All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.


Anyone got any other improvement ideas or potential pitfalls that this could cause?

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1744] Re: Indexing speed improvements

rosarior
Administrator
I'm rewriting most of the indexing code and managed to include reindexing for individual indexes and not all at once. Commit here: https://gitlab.com/mayan-edms/mayan-edms/commit/ac6f748113932d91f23f15dffd9a2ba95b2a1b66
The rewrite allows the use of less lock (just 2 now) so it is already much faster. This rewrite also open the possibility of indexing by workflow states and tags. The code is in a separate branch of the master branch (2.2) to try and push this to a next stable release (2.2.1 or 2.3) instead of waiting for the next major version (3.0). If you have a development install of Mayan please help test this branch to make its inclusion faster.

On Saturday, May 27, 2017 at 2:07:31 PM UTC-4, Roberto Rosario wrote:
Doing some tests I've hit several regressions and a few race conditions (without the 'document_indexing_task_do_rebuild_all_indexes' lock, deleting a document would delete it's index instance if it is empty even while an index is being rebuilt).
The entire indexing locking workflow will need to be remade too. This refactor is bigger than initially expected.  

On Saturday, May 27, 2017 at 11:01:56 AM UTC-4, Roberto Rosario wrote:
That's great! Going through your changes to see how much I can move upstream.

On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote:
I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.

Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:
    def rebuild_all_indexes(self):
        from .models import Index
       
        for index in Index.objects.filter(enabled=True):
            print 'indexing',index
            #Delete nodes applicable to index
            print 'deleting nodes'
            for instance_node in self.filter(id=<a href="http://index.id" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Findex.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFON5wls_R-Ph86P5ooM-fntBRbhw&#39;;return true;">index.id):
                instance_node.delete()
            #Delete empty nodes
            self.delete_empty_index_nodes()  
            print 'adding index node'
            #Add index node
            root_instance, created = self.get_or_create(
                index_template_node=index.template_root, parent=None
            )
            print 'indexing documents...'
            docsIndexed = 0
            #Reindex each document
            for document in Document.objects.filter(document_type=index.document_types.all()):
               
                #Add index nodes?
                for template_node in index.template_root.get_children():
                    self.cascade_eval(document, template_node, root_instance)
                docsIndexed += 1
                if docsIndexed % 10 == 0:
                    print 'indexing document',document,docsIndexed,'completed'
All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode.


Anyone got any other improvement ideas or potential pitfalls that this could cause?

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Loading...