[Mayan EDMS: 1599] Search for document content not OCRed within Mayan

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1599] Search for document content not OCRed within Mayan

andre
Hi,

I just discovered Mayan, installed the most recent version (on bare metal, not docker) and really like it. I have been using Alfresco for my personal and home office documents so far, and it seems Mayan could replace it easily for my needs.

But there's one thing I I'm not sure I got right: Will I only find content of documents which have been OCRed in Mayan? Over the years I scanned some thousand docs, some of them with a manually adjusted OCR recognition (complex tables and stuff). All of my PDFs have been OCRed over the years, and I guess it would take Mayan / tesseract weeks to do this work - I wouldn't want that.

How can I activate search for these existing contents? Also, what about (for example) Word or Powerpoint documents, is there a way to search within them?

thank you!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1600] Re: Search for document content not OCRed within Mayan

andre
I am a little bit surprised of not seeing an answer here after a few days - don't get me wrong. I am really keen using Mayan in the future.

I am doing the following: Import a searchable PDF without having an OCR performed during the import - because it already has the text.
Use the search function to search for content which I know is in the PDF.
Nothing is found.

Is this by design, so Mayan can only search within the text that was recognized within Mayan and existing text isn't added to the search index?
Or am I doing something wrong?

Thanks a lot!


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1602] Re: Search for document content not OCRed within Mayan

Manuel Reiter
Hi andre,

I'll admit I've also been a bit underwhelmed by the activity in this group. Don't get me wrong anybody, I know everybody here is a volunteer and the answers I *did* receive were friendly and helpful - it's just that I've seen communities around open source software that were a bit more lively. Makes one wonder a bit how alive Mayan actually is as an open source project. Maybe we just picked a bad time to discover it, Easter might have kept a couple of people busy.

As for your question, I'm afraid I can't help you - I'd be interested in the answer myself though, so I hope someone still gives an answer.

On Monday, April 17, 2017 at 5:23:23 PM UTC+2, andre wrote:
I am a little bit surprised of not seeing an answer here after a few days - don't get me wrong. I am really keen using Mayan in the future.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1603] Re: Search for document content not OCRed within Mayan

MacRobb Simpson
In reply to this post by andre
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1625] Re: Search for document content not OCRed within Mayan

rosarior
Administrator
The OCR app will always try to parse the text of previously OCRed PDFs, office documents and text files before attempting the OCR step (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).

Several parsers can be registered and will be tried in sequence. A Poppler and a PDFMiner parser are included by default (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner parser could be removed if a viable, drop in replacement that supports Python 3.x is not found by the next relase. 

If the text is not being parsed, check the logs and make sure the package `poppler-utils` is installed. If a stable Python only PDF text parser is found these binary dependencies can be removed.

On the topic of activity: 

The project is release free of charge with almost all rights provided to change and reuse the code. Expecting fast, on-point, free support in addition to that is unrealistic.

Low participation for technical queries in forums and mailing lists is a common situation with open projects. Any suggestion or ideas to help improve on that are welcomed.

Bear in mind that not all (if not most) subscribers to this list are not developers but users like yourself. Expecting professional advice from other users is unrealistic. 

Myself, core contributors, a few developers, devops personnel visit the list from time to time but this is not the only task we do in the project, there is also backend code, API code, frontend code, deployments (Docker, Salt, Fabric, etc), code testing, compatibility testing (database, python versions, OS, cloud environments), documentation, translations, design decisions, consulting, ticket triage, support, customization, website, social media sites, events (DjangoCon, PyCon), etc. Any help on those other areas will translate in more time for us to answer questions in the list. There are other non code decisions that occupy a lot of time researching, ie: Google Groups is showing its age and there is a discussion whether or not to ditch it and move to a proper (probably paid from our pockets) forum solution. Another matter is funding and making the project self sustaining. To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that in the near future we could have paid developers working full time on the code and providing support, instead of just part time volunteers. This means a new set of tasks, documents, and legal procedures that need to be taken care. 

Mayan EDMS was started 6 years ago and is used by the State of California, the Government of Puerto Rico, The University of Montreal, Intel, with CEMEX and Deloitte recently joining, just to name a few known names (http://www.mayan-edms.com/cases/). It is very much alive and picking up steam :)  For users or organizations needing timely response from core contributors, be it consulting or support, paid plans are available (http://www.mayan-edms.com/providers/). Customization and rebranding are also available if needed.

There are many areas that are not code related where a little help goes a long way. Even stuff like spell checking or just taking the time to add additional information on a ticket or bug report helps a lot!

I appreciate your concerns and opinions about the project and hope that we continue sharing and discussing them.

On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,<a href="http://p.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fp.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIAR8e5XORlw08iI1aHcV7TDh77g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fp.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIAR8e5XORlw08iI1aHcV7TDh77g&#39;;return true;">p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on <a href="http://d.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fd.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4G0XyqVBYfdQ_3fZsvmf_uatkmg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fd.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH4G0XyqVBYfdQ_3fZsvmf_uatkmg&#39;;return true;">d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=<a href="http://v.id" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fv.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFqDDKdNxdpHd29jPqlLVtbgkuCjg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fv.id\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFqDDKdNxdpHd29jPqlLVtbgkuCjg&#39;;return true;">v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Mayan EDMS: 1629] Re: Search for document content not OCRed within Mayan

Jesaja Everling
Just to add a quick note: I'm sure there are many people that like me read the mailing list but don't chime in if they don't have a useful answer to offer for a question.

On Fri, Apr 21, 2017 at 1:42 AM, Roberto Rosario <[hidden email]> wrote:
The OCR app will always try to parse the text of previously OCRed PDFs, office documents and text files before attempting the OCR step (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).

Several parsers can be registered and will be tried in sequence. A Poppler and a PDFMiner parser are included by default (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner parser could be removed if a viable, drop in replacement that supports Python 3.x is not found by the next relase. 

If the text is not being parsed, check the logs and make sure the package `poppler-utils` is installed. If a stable Python only PDF text parser is found these binary dependencies can be removed.

On the topic of activity: 

The project is release free of charge with almost all rights provided to change and reuse the code. Expecting fast, on-point, free support in addition to that is unrealistic.

Low participation for technical queries in forums and mailing lists is a common situation with open projects. Any suggestion or ideas to help improve on that are welcomed.

Bear in mind that not all (if not most) subscribers to this list are not developers but users like yourself. Expecting professional advice from other users is unrealistic. 

Myself, core contributors, a few developers, devops personnel visit the list from time to time but this is not the only task we do in the project, there is also backend code, API code, frontend code, deployments (Docker, Salt, Fabric, etc), code testing, compatibility testing (database, python versions, OS, cloud environments), documentation, translations, design decisions, consulting, ticket triage, support, customization, website, social media sites, events (DjangoCon, PyCon), etc. Any help on those other areas will translate in more time for us to answer questions in the list. There are other non code decisions that occupy a lot of time researching, ie: Google Groups is showing its age and there is a discussion whether or not to ditch it and move to a proper (probably paid from our pockets) forum solution. Another matter is funding and making the project self sustaining. To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that in the near future we could have paid developers working full time on the code and providing support, instead of just part time volunteers. This means a new set of tasks, documents, and legal procedures that need to be taken care. 

Mayan EDMS was started 6 years ago and is used by the State of California, the Government of Puerto Rico, The University of Montreal, Intel, with CEMEX and Deloitte recently joining, just to name a few known names (http://www.mayan-edms.com/cases/). It is very much alive and picking up steam :)  For users or organizations needing timely response from core contributors, be it consulting or support, paid plans are available (http://www.mayan-edms.com/providers/). Customization and rebranding are also available if needed.

There are many areas that are not code related where a little help goes a long way. Even stuff like spell checking or just taking the time to add additional information on a ticket or bug report helps a lot!

I appreciate your concerns and opinions about the project and hope that we continue sharing and discussing them.

On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
Here's something that /may/ help:

In mayan, the OCR text is located in the `ocr_documentpagecontent` table
It's per page(unfortunate, but if you don't care, you might be able to just shove all your OCR'd text into Page 1 of each document).

Here's a SQL query to start with:
SELECT d.label,p.page_number,p.id FROM `documents_document` as d
inner join `documents_documentversion` as v on d.id=v.document_id
inner join `documents_documentpage` as p on p.document_version_id=v.id
WHERE 1 limit 100

This will get you a list of document labels(you might want the ID or other stuff), page numbers and unique page IDs. The Unique IDs are what you need to create rows in the `ocr_documentpagecontent` table.

It may not be a perfect solution, but you can definitely rig up some stuff to get what you need, supported or not!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1643] Re: Search for document content not OCRed within Mayan

andre
In reply to this post by andre
Hi Roberto,

first, please don't get my comment for the duration to answer as a critic - that wasn't my intention. I am using a few Open Source projects and have the biggest respect and sympathy towards their creators and contributors for sharing it. Also, I wouldn't "demand" an answer as I am aware that I wouldn't be able to make something like this possible. I was wondering much more because I expected an easy answer like "yes, that's possible" or "no, won't work" by you or some users. Not seeing an answer raised some concerns about the vitality of the project. So: thank you for what you have created, and one of the first things positive things I noticed that I would not have to waste RAM and energy consumption as overhead to a dumb JVM implementation.

Btw also thank you for the code examples you linked to, but I don't really understand what's happening there as I am not a developer. 

That being said, I tried the following:

Please check the attached cream recipe PDF :) it has been scanned and then OCRed using OCRKit on macOS. It's searchable. poppler-utils are installed and up-to-date. Now when I add it, the following logs are generated:

documents.models <1318> [INFO] "new_version() Creating new document version for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,319: INFO/Worker-4] Creating new document version for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "save() Creating new version for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,321: INFO/Worker-4] Creating new version for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "save() New document version "scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00" created for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,569: INFO/Worker-4] New document version "scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00" created for document: scan11429-OCR.pdf

documents.models <1318> [INFO] "new_version() New document version queued for document: scan11429-OCR.pdf"

[2017-04-22 06:09:04,602: INFO/Worker-4] New document version queued for document: scan11429-OCR.pdf

ocr.tasks <1316> [INFO] "task_do_ocr() Starting document OCR for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,334: INFO/Worker-2] Starting document OCR for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.parsers <1316> [INFO] "process_document_page() Processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,344: INFO/Worker-2] Processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.parsers <1316> [INFO] "process_document_page() Finished processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,411: INFO/Worker-2] Finished processing page: 1 of document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

ocr.tasks <1316> [INFO] "task_do_ocr() OCR complete for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00"

[2017-04-22 06:09:07,412: INFO/Worker-2] OCR complete for document version: scan11429-OCR.pdf - 2017-04-22 06:09:04.326366+00:00

documents.managers <1315> [INFO] "check_delete_periods() Executing"


According to the logs, OCR is done again. And I can find the document by the terms in it.

If I disable the automatic OCR queue for this document type, it also is NOT found.

Please lt me know which further information I can provide.

Thanks in advance
André



Am Freitag, 14. April 2017 18:37:31 UTC+2 schrieb andre:
Hi,

I just discovered Mayan, installed the most recent version (on bare metal, not docker) and really like it. I have been using Alfresco for my personal and home office documents so far, and it seems Mayan could replace it easily for my needs.

But there's one thing I I'm not sure I got right: Will I only find content of documents which have been OCRed in Mayan? Over the years I scanned some thousand docs, some of them with a manually adjusted OCR recognition (complex tables and stuff). All of my PDFs have been OCRed over the years, and I guess it would take Mayan / tesseract weeks to do this work - I wouldn't want that.

How can I activate search for these existing contents? Also, what about (for example) Word or Powerpoint documents, is there a way to search within them?

thank you!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

scan11429-OCR.pdf (736K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1647] Re: Search for document content not OCRed within Mayan

Manuel Reiter
In reply to this post by rosarior
Thanks for your post, Roberto! Like andre, I really appreciate that you offer Mayan as an open source solution and all the effort you put into this. Thank you very much indeed for that! In my experience, however, dwindling activity around open source projects *can* be a sign that a project is not too well  - if that's not the case with Mayan and, as I wrote earlier, we just came at maybe not the best of moments, all the better for it!

I'm still in the process of deciding whether Mayan fits (or can be made to fit) my needs and thus whether I'll stay with it. If I do, I'll certainly be happy to put some of my time into helping the project out.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1657] Re: Search for document content not OCRed within Mayan

andre
In reply to this post by andre
So, I tried to get more to the bottom of it. @RobertoRosario would you please clarify at least my first question: 

Do you also call the extraction exiting text "OCR processing"? Yes / No?

This is what I suspect after spending some time trying to understand what happens in the code (I am not a dev), making sure python-pdfminer is installed and watching the logs for NoMIMETypeMatch and ParserError. Now I just kept the "OCR processing" enabled, watched the task manager and threw some dozen files into the upload queue -> No visible tesseract process, and everything finished much faster than real OCR processing would have. 
If my conclusions are right, everything works as it should, and all the time has been. But if you, dear reader, are understanding OCR as "Optical Character Recognition" (like I do) and not as "parse existing text from documents and if that fails do a real Optical Character Recognition" as I believe it happens here, you are very likely to waste the same amount of time when you are trying to plan things from the beginning.

yes, that also was a little rant, but hopefully this clarification (if so) can be seen as a contribution, too.

Now let's find out how to update to the 2.2 when it's available, and if I find it documented somewhere I might have some time left afterwards to translate some phrases into german. Motivation is there ;)

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1679] Re: Search for document content not OCRed within Mayan

Dave S
In reply to this post by andre
Hi Andre,

I am new here, though have years of experience with supporting and developing a commercial Enterprise Content Management system.  I am installing Mayan now, so I apologize if I am missing something that will become obvious upon use, but I am curious about the need for OCR for the majority of documents.  Would the inclusion of Document Type appropriate (manually entered) Metadata allow you to find the information you are searching for?  

OCR is a wonderful thing and something that I enjoy working with, though there can be challenges in getting the OCR'ed data (accurately) and then being able to use that information in a meaningful manner.  Generally, unless I need to have that information and can consistently assign (some of the discreet) data to the Metadata - and I can afford the processing time/expense - manual indexing or reading barcodes (a whole other discussion! :-) ) meets 90+% of my needs.

Perhaps once I start playing my question will answer itself, and I certainly don't mean any offense, but I am interested in how people are using the OCR'ed information (and related, has it been found to be accurate in the vast majority - 95+% - of the time).

Thanks!

dave

On Friday, April 14, 2017 at 11:37:31 AM UTC-5, andre wrote:
Hi,

I just discovered Mayan, installed the most recent version (on bare metal, not docker) and really like it. I have been using Alfresco for my personal and home office documents so far, and it seems Mayan could replace it easily for my needs.

But there's one thing I I'm not sure I got right: Will I only find content of documents which have been OCRed in Mayan? Over the years I scanned some thousand docs, some of them with a manually adjusted OCR recognition (complex tables and stuff). All of my PDFs have been OCRed over the years, and I guess it would take Mayan / tesseract weeks to do this work - I wouldn't want that.

How can I activate search for these existing contents? Also, what about (for example) Word or Powerpoint documents, is there a way to search within them?

thank you!

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1679] Re: Search for document content not OCRed within Mayan

Dave S
In reply to this post by andre

Hi Andre,

I am new here, though have years of experience with supporting and developing a commercial Enterprise Content Management system.  I am installing Mayan now, so I apologize if I am missing something that will become obvious upon use, but I am curious about the need for OCR for the majority of documents.  Would the inclusion of Document Type appropriate (manually entered) Metadata allow you to find the information you are searching for?  

OCR is a wonderful thing and something that I enjoy working with, though there can be challenges in getting the OCR'ed data (accurately) and then being able to use that information in a meaningful manner.  Generally, unless I need to have that information and can consistently assign (some of the discreet) data to the Metadata - and I can afford the processing time/expense - manual indexing or reading barcodes (a whole other discussion! :-) ) meets 90+% of my needs.

Perhaps once I start playing my question will answer itself, and I certainly don't mean any offense, but I am interested in how people are using the OCR'ed information (and related, has it been found to be accurate in the vast majority - 95+% - of the time).

Thanks!

dave

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1682] Re: Search for document content not OCRed within Mayan

andre
Hi Dave,

I wouldn't say that my requirements for OCR are totally aligned with those of business users. For me, the DMS is mainly a (chaotic) storage for every digital information I collect in my personal and business life. Getting rid of paper as much as possible, but I am not looking to use any workflows for example to manage those bills I receive. I am using the full text index to find everything, and that usually gives quick results with only knowing a few search terms, like account numbers, keywords ("bill", "food", account statement",...), names and so on. And it's absolutely uncritical if i spend two minutes instead of a few seconds for searching, because I have to try some different stuff.

So for me, there are three, maybe four types of "information containers" relevant:

- digital content I have created, like office docs, emails and stuff (not photos, they are managed separately) - no OCR necessary
- PDFs I receive - no OCR neccessary
- PDF from scanned paper - always OCRed. I don't go too much for 100% accuracy (while I would say that the results are very close), but sometimes there are complex documents which get some "manual attention", an example is that they might be bilingual.

You see, everything is about the full text index content, so I do not care much about other metadata. But if I invested some time for better OCR results then of course I wouldn't want to see it go wasted by having this overwritten - if your question is targeted towards my initial requests here. And of course in this case it's relevant to know how these information are treated by the DMS.




Am Donnerstag, 4. Mai 2017 00:14:31 UTC+2 schrieb Dave S:

Hi Andre,

I am new here, though have years of experience with supporting and developing a commercial Enterprise Content Management system.  I am installing Mayan now, so I apologize if I am missing something that will become obvious upon use, but I am curious about the need for OCR for the majority of documents.  Would the inclusion of Document Type appropriate (manually entered) Metadata allow you to find the information you are searching for?  

OCR is a wonderful thing and something that I enjoy working with, though there can be challenges in getting the OCR'ed data (accurately) and then being able to use that information in a meaningful manner.  Generally, unless I need to have that information and can consistently assign (some of the discreet) data to the Metadata - and I can afford the processing time/expense - manual indexing or reading barcodes (a whole other discussion! :-) ) meets 90+% of my needs.

Perhaps once I start playing my question will answer itself, and I certainly don't mean any offense, but I am interested in how people are using the OCR'ed information (and related, has it been found to be accurate in the vast majority - 95+% - of the time).

Thanks!

dave

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1736] Re: Search for document content not OCRed within Mayan

rosarior
Administrator
Been thinking about splitting the extraction of content into two separate areas: one for recognized text data (OCR, barcodes in the future) and embedded or parsed text (test files, PDF with text). The idea is to give user a better expectation about the quality of the text given the area or tab they access. OCR is expected to have error, parsed text is expected to have low or no errors.

It would be a small split in the OCR app and some minor UI changes. That makes sense?

On Sunday, May 7, 2017 at 3:48:06 PM UTC-4, andre wrote:
Hi Dave,

I wouldn't say that my requirements for OCR are totally aligned with those of business users. For me, the DMS is mainly a (chaotic) storage for every digital information I collect in my personal and business life. Getting rid of paper as much as possible, but I am not looking to use any workflows for example to manage those bills I receive. I am using the full text index to find everything, and that usually gives quick results with only knowing a few search terms, like account numbers, keywords ("bill", "food", account statement",...), names and so on. And it's absolutely uncritical if i spend two minutes instead of a few seconds for searching, because I have to try some different stuff.

So for me, there are three, maybe four types of "information containers" relevant:

- digital content I have created, like office docs, emails and stuff (not photos, they are managed separately) - no OCR necessary
- PDFs I receive - no OCR neccessary
- PDF from scanned paper - always OCRed. I don't go too much for 100% accuracy (while I would say that the results are very close), but sometimes there are complex documents which get some "manual attention", an example is that they might be bilingual.

You see, everything is about the full text index content, so I do not care much about other metadata. But if I invested some time for better OCR results then of course I wouldn't want to see it go wasted by having this overwritten - if your question is targeted towards my initial requests here. And of course in this case it's relevant to know how these information are treated by the DMS.




Am Donnerstag, 4. Mai 2017 00:14:31 UTC+2 schrieb Dave S:

Hi Andre,

I am new here, though have years of experience with supporting and developing a commercial Enterprise Content Management system.  I am installing Mayan now, so I apologize if I am missing something that will become obvious upon use, but I am curious about the need for OCR for the majority of documents.  Would the inclusion of Document Type appropriate (manually entered) Metadata allow you to find the information you are searching for?  

OCR is a wonderful thing and something that I enjoy working with, though there can be challenges in getting the OCR'ed data (accurately) and then being able to use that information in a meaningful manner.  Generally, unless I need to have that information and can consistently assign (some of the discreet) data to the Metadata - and I can afford the processing time/expense - manual indexing or reading barcodes (a whole other discussion! :-) ) meets 90+% of my needs.

Perhaps once I start playing my question will answer itself, and I certainly don't mean any offense, but I am interested in how people are using the OCR'ed information (and related, has it been found to be accurate in the vast majority - 95+% - of the time).

Thanks!

dave

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1777] Re: Search for document content not OCRed within Mayan

David Kornahrens
Makes sense to me.  We have a few documents that weren't scanned by OCR, but I do not see anything in the logs to represent that.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Loading...