Quantcast

[Mayan EDMS: 1599] Search for document content not OCRed within Mayan

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1599] Search for document content not OCRed within Mayan

andre
Hi,

I just discovered Mayan, installed the most recent version (on bare metal, not docker) and really like it. I have been using Alfresco for my personal and home office documents so far, and it seems Mayan could replace it easily for my needs.

But there's one thing I I'm not sure I got right: Will I only find content of documents which have been OCRed in Mayan? Over the years I scanned some thousand docs, some of them with a manually adjusted OCR recognition (complex tables and stuff). All of my PDFs have been OCRed over the years, and I guess it would take Mayan / tesseract weeks to do this work - I wouldn't want that.

How can I activate search for these existing contents? Also, what about (for example) Word or Powerpoint documents, is there a way to search within them?

thank you!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1600] Re: Search for document content not OCRed within Mayan

andre
I am a little bit surprised of not seeing an answer here after a few days - don't get me wrong. I am really keen using Mayan in the future.

I am doing the following: Import a searchable PDF without having an OCR performed during the import - because it already has the text.
Use the search function to search for content which I know is in the PDF.
Nothing is found.

Is this by design, so Mayan can only search within the text that was recognized within Mayan and existing text isn't added to the search index?
Or am I doing something wrong?

Thanks a lot!


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1602] Re: Search for document content not OCRed within Mayan

Manuel Reiter
Hi andre,

I'll admit I've also been a bit underwhelmed by the activity in this group. Don't get me wrong anybody, I know everybody here is a volunteer and the answers I *did* receive were friendly and helpful - it's just that I've seen communities around open source software that were a bit more lively. Makes one wonder a bit how alive Mayan actually is as an open source project. Maybe we just picked a bad time to discover it, Easter might have kept a couple of people busy.

As for your question, I'm afraid I can't help you - I'd be interested in the answer myself though, so I hope someone still gives an answer.

On Monday, April 17, 2017 at 5:23:23 PM UTC+2, andre wrote:
I am a little bit surprised of not seeing an answer here after a few days - don't get me wrong. I am really keen using Mayan in the future.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1603] Re: Search for document content not OCRed within Mayan

MacRobb Simpson
In reply to this post by andre
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1625] Re: Search for document content not OCRed within Mayan

rosarior
Administrator
The OCR app will always try to parse the text of previously OCRed PDFs, office documents and text files before attempting the OCR step (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).

Several parsers can be registered and will be tried in sequence. A Poppler and a PDFMiner parser are included by default (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner parser could be removed if a viable, drop in replacement that supports Python 3.x is not found by the next relase. 

If the text is not being parsed, check the logs and make sure the package `poppler-utils` is installed. If a stable Python only PDF text parser is found these binary dependencies can be removed.

On the topic of activity: 

The project is release free of charge with almost all rights provided to change and reuse the code. Expecting fast, on-point, free support in addition to that is unrealistic.

Low participation for technical queries in forums and mailing lists is a common situation with open projects. Any suggestion or ideas to help improve on that are welcomed.

Bear in mind that not all (if not most) subscribers to this list are not developers but users like yourself. Expecting professional advice from other users is unrealistic. 

Myself, core contributors, a few developers, devops personnel visit the list from time to time but this is not the only task we do in the project, there is also backend code, API code, frontend code, deployments (Docker, Salt, Fabric, etc), code testing, compatibility testing (database, python versions, OS, cloud environments), documentation, translations, design decisions, consulting, ticket triage, support, customization, website, social media sites, events (DjangoCon, PyCon), etc. Any help on those other areas will translate in more time for us to answer questions in the list. There are other non code decisions that occupy a lot of time researching, ie: Google Groups is showing its age and there is a discussion whether or not to ditch it and move to a proper (probably paid from our pockets) forum solution. Another matter is funding and making the project self sustaining. To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that in the near future we could have paid developers working full time on the code and providing support, instead of just part time volunteers. This means a new set of tasks, documents, and legal procedures that need to be taken care. 

Mayan EDMS was started 6 years ago and is used by the State of California, the Government of Puerto Rico, The University of Montreal, Intel, with CEMEX and Deloitte recently joining, just to name a few known names (http://www.mayan-edms.com/cases/). It is very much alive and picking up steam :)  For users or organizations needing timely response from core contributors, be it consulting or support, paid plans are available (http://www.mayan-edms.com/providers/). Customization and rebranding are also available if needed.

There are many areas that are not code related where a little help goes a long way. Even stuff like spell checking or just taking the time to add additional information on a ticket or bug report helps a lot!

I appreciate your concerns and opinions about the project and hope that we continue sharing and discussing them.

On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,<a href="http://p.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fp.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIAR8e5XORlw08iI1aHcV7TDh77g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fp.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIAR8e5XORlw08iI1aHcV7TDh77g&#39;;return true;">p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on <a href="http://d.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fd.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4G0XyqVBYfdQ_3fZsvmf_uatkmg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fd.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4G0XyqVBYfdQ_3fZsvmf_uatkmg&#39;;return true;">d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=<a href="http://v.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fv.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFqDDKdNxdpHd29jPqlLVtbgkuCjg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fv.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFqDDKdNxdpHd29jPqlLVtbgkuCjg&#39;;return true;">v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Mayan EDMS: 1629] Re: Search for document content not OCRed within Mayan

Jesaja Everling
Just to add a quick note: I'm sure there are many people that like me read the mailing list but don't chime in if they don't have a useful answer to offer for a question.

On Fri, Apr 21, 2017 at 1:42 AM, Roberto Rosario <[hidden email]> wrote:
The OCR app will always try to parse the text of previously OCRed PDFs, office documents and text files before attempting the OCR step (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).

Several parsers can be registered and will be tried in sequence. A Poppler and a PDFMiner parser are included by default (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner parser could be removed if a viable, drop in replacement that supports Python 3.x is not found by the next relase. 

If the text is not being parsed, check the logs and make sure the package `poppler-utils` is installed. If a stable Python only PDF text parser is found these binary dependencies can be removed.

On the topic of activity: 

The project is release free of charge with almost all rights provided to change and reuse the code. Expecting fast, on-point, free support in addition to that is unrealistic.

Low participation for technical queries in forums and mailing lists is a common situation with open projects. Any suggestion or ideas to help improve on that are welcomed.

Bear in mind that not all (if not most) subscribers to this list are not developers but users like yourself. Expecting professional advice from other users is unrealistic. 

Myself, core contributors, a few developers, devops personnel visit the list from time to time but this is not the only task we do in the project, there is also backend code, API code, frontend code, deployments (Docker, Salt, Fabric, etc), code testing, compatibility testing (database, python versions, OS, cloud environments), documentation, translations, design decisions, consulting, ticket triage, support, customization, website, social media sites, events (DjangoCon, PyCon), etc. Any help on those other areas will translate in more time for us to answer questions in the list. There are other non code decisions that occupy a lot of time researching, ie: Google Groups is showing its age and there is a discussion whether or not to ditch it and move to a proper (probably paid from our pockets) forum solution. Another matter is funding and making the project self sustaining. To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that in the near future we could have paid developers working full time on the code and providing support, instead of just part time volunteers. This means a new set of tasks, documents, and legal procedures that need to be taken care. 

Mayan EDMS was started 6 years ago and is used by the State of California, the Government of Puerto Rico, The University of Montreal, Intel, with CEMEX and Deloitte recently joining, just to name a few known names (http://www.mayan-edms.com/cases/). It is very much alive and picking up steam :)  For users or organizations needing timely response from core contributors, be it consulting or support, paid plans are available (http://www.mayan-edms.com/providers/). Customization and rebranding are also available if needed.

There are many areas that are not code related where a little help goes a long way. Even stuff like spell checking or just taking the time to add additional information on a ticket or bug report helps a lot!

I appreciate your concerns and opinions about the project and hope that we continue sharing and discussing them.

On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1643] Re: Search for document content not OCRed within Mayan

andre
In reply to this post by andre
Hi Roberto,

first, please don't get my comment for the duration to answer as a critic - that wasn't my intention. I am using a few Open Source projects and have the biggest respect and sympathy towards their creators and contributors for sharing it. Also, I wouldn't "demand" an answer as I am aware that I wouldn't be able to make something like this possible. I was wondering much more because I expected an easy answer like "yes, that's possible" or "no, won't work" by you or some users. Not seeing an answer raised some concerns about the vitality of the project. So: thank you for what you have created, and one of the first things positive things I noticed that I would not have to waste RAM and energy consumption as overhead to a dumb JVM implementation.

Btw also thank you for the code examples you linked to, but I don't really understand what's happening there as I am not a developer. 

That being said, I tried the following:

Please check the attached cream recipe PDF :) it has been scanned and then OCRed using OCRKit on macOS. It's searchable. poppler-utils are installed and up-to-date. Now when I add it, the following logs are generated:

documents.models <1318> [INFO] "new_version() Creating new document version for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,319: INFO/Worker-4] Creating new document version for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "save() Creating new version for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,321: INFO/Worker-4] Creating new version for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "save() New document version "scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00" created for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,569: INFO/Worker-4] New document version "scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00" created for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "new_version() New document version queued for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,602: INFO/Worker-4] New document version queued for document: scan11429-OCR.pdf

ocr.tasks <1316> [INFO] "task_do_ocr() Starting document OCR for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,334: INFO/Worker-2] Starting document OCR for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.parsers <1316> [INFO] "process_document_page() Processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,344: INFO/Worker-2] Processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.parsers <1316> [INFO] "process_document_page() Finished processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,411: INFO/Worker-2] Finished processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.tasks <1316> [INFO] "task_do_ocr() OCR complete for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,412: INFO/Worker-2] OCR complete for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

documents.managers <1315> [INFO] "check_delete_periods() Executing"


According to the logs, OCR is done again. And I can find the document by the terms in it.

If I disable the automatic OCR queue for this document type, it also is NOT found.

Please lt me know which further information I can provide.

Thanks in advance
André



Am Freitag, 14. April 2017 18:37:31 UTC+2 schrieb andre:
Hi,

I just discovered Mayan, installed the most recent version (on bare metal, not docker) and really like it. I have been using Alfresco for my personal and home office documents so far, and it seems Mayan could replace it easily for my needs.

But there's one thing I I'm not sure I got right: Will I only find content of documents which have been OCRed in Mayan? Over the years I scanned some thousand docs, some of them with a manually adjusted OCR recognition (complex tables and stuff). All of my PDFs have been OCRed over the years, and I guess it would take Mayan / tesseract weeks to do this work - I wouldn't want that.

How can I activate search for these existing contents? Also, what about (for example) Word or Powerpoint documents, is there a way to search within them?

thank you!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

scan11429-OCR.pdf (736K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1647] Re: Search for document content not OCRed within Mayan

Manuel Reiter
In reply to this post by rosarior
Thanks for your post, Roberto! Like andre, I really appreciate that you offer Mayan as an open source solution and all the effort you put into this. Thank you very much indeed for that! In my experience, however, dwindling activity around open source projects *can* be a sign that a project is not too well  - if that's not the case with Mayan and, as I wrote earlier, we just came at maybe not the best of moments, all the better for it!

I'm still in the process of deciding whether Mayan fits (or can be made to fit) my needs and thus whether I'll stay with it. If I do, I'll certainly be happy to put some of my time into helping the project out.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1657] Re: Search for document content not OCRed within Mayan

andre
In reply to this post by andre
So, I tried to get more to the bottom of it. @RobertoRosario would you please clarify at least my first question: 

Do you also call the extraction exiting text "OCR processing"? Yes / No?

This is what I suspect after spending some time trying to understand what happens in the code (I am not a dev), making sure python-pdfminer is installed and watching the logs for NoMIMETypeMatch and ParserError. Now I just kept the "OCR processing" enabled, watched the task manager and threw some dozen files into the upload queue -> No visible tesseract process, and everything finished much faster than real OCR processing would have. 
If my conclusions are right, everything works as it should, and all the time has been. But if you, dear reader, are understanding OCR as "Optical Character Recognition" (like I do) and not as "parse existing text from documents and if that fails do a real Optical Character Recognition" as I believe it happens here, you are very likely to waste the same amount of time when you are trying to plan things from the beginning.

yes, that also was a little rant, but hopefully this clarification (if so) can be seen as a contribution, too.

Now let's find out how to update to the 2.2 when it's available, and if I find it documented somewhere I might have some time left afterwards to translate some phrases into german. Motivation is there ;)

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Loading...