Quantcast

[Mayan EDMS: 1466] Automaticall search on a document

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1466] Automaticall search on a document

odootester2016
Hello,
I'm looking for a programm, which could read a document and extract informations from it. 
For example, I become a bill from Apple (the programm would recognize it, because I would have defined if in this region, there is Apple with its adress and also defined the placed which define for Apple where to find, it is a bill) and I would like to extract from it for example the bill number (which should always be on the same place) and the total price of the bill (the place of it differ, depending on the number of articles I ordered.

I unfortunatly didn't find the technical word for finding it on the web. How is this called? Is this possible with Mayan EDMS? 

I thank you already for replying and wish you a good day,

Cheers,

Sam

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1470] Re: Automaticall search on a document

rosarior
Administrator
You are describing two different features. One is called OCR zones, the other is layout analysis (example: https://github.com/tabulapdf/tabula). I've seen forks of Mayan with this feature added as a commercial plugin but none have donated the code to be added to the core version I develop. These features are very complex, costly and located in a patent minefield (http://patents.justia.com/patents-by-us-classification/382/321). Without external sponsorship I'm not able to implement these.

On Friday, December 16, 2016 at 5:36:25 AM UTC-4, [hidden email] wrote:
Hello,
I'm looking for a programm, which could read a document and extract informations from it. 
For example, I become a bill from Apple (the programm would recognize it, because I would have defined if in this region, there is Apple with its adress and also defined the placed which define for Apple where to find, it is a bill) and I would like to extract from it for example the bill number (which should always be on the same place) and the total price of the bill (the place of it differ, depending on the number of articles I ordered.

I unfortunatly didn't find the technical word for finding it on the web. How is this called? Is this possible with Mayan EDMS? 

I thank you already for replying and wish you a good day,

Cheers,

Sam

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1472] Re: Automaticall search on a document

Matthias Löblich
In reply to this post by odootester2016
Hi,
I am also looking for similar features. At the moment I am using an "not bullet prove" workaround using regular expressions
to identify specific documents using a mayan app written by me:

https://gitlab.com/mayan-edms/document_analyzer

@Roberto:
One thing could improve the identification: Storing the HOCR data provided by tesseract and not only the plain text.
HOCR also includes layout information. So it could by possible to combine the regex search with an layout "query" based on the HOCR data.
What do you think about extending the OCR App model DocumentPageContent with an flag indicating if the content is plain text or HOCR.
If the content is HOCR there should be an hocr-parser extracting the plain text, so the new format is not impacting the other parts of mayan.
I would by happy to support the development to extend the OCR app in this direction.

br
Matthias
PS.: Features like that could be possible by storing the HOCR data: https://github.com/shsdev/hocr-parser-hadoopjob


Am Freitag, 16. Dezember 2016 10:36:25 UTC+1 schrieb [hidden email]:
Hello,
I'm looking for a programm, which could read a document and extract informations from it. 
For example, I become a bill from Apple (the programm would recognize it, because I would have defined if in this region, there is Apple with its adress and also defined the placed which define for Apple where to find, it is a bill) and I would like to extract from it for example the bill number (which should always be on the same place) and the total price of the bill (the place of it differ, depending on the number of articles I ordered.

I unfortunatly didn't find the technical word for finding it on the web. How is this called? Is this possible with Mayan EDMS? 

I thank you already for replying and wish you a good day,

Cheers,

Sam

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Mayan EDMS: 1479] Re: Automaticall search on a document

rosarior
Administrator
Having a flag to differentiate between OCR text and hOCR is a good idea. Now that the default OCR has been updated
to use PyOCR (which exposes hOCR) this could be possible in the future.

https://gitlab.com/mayan-edms/mayan-edms/commit/6bfdb053e3abec87aa55c987e5a13a72514ee682

On Friday, December 30, 2016 at 1:48:34 PM UTC-4, Matthias Löblich wrote:
Hi,
I am also looking for similar features. At the moment I am using an "not bullet prove" workaround using regular expressions
to identify specific documents using a mayan app written by me:

<a href="https://gitlab.com/mayan-edms/document_analyzer" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fdocument_analyzer\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH7nxNDt0oYftJxcct4HYsA9z9Xvw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fdocument_analyzer\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH7nxNDt0oYftJxcct4HYsA9z9Xvw&#39;;return true;">https://gitlab.com/mayan-edms/document_analyzer

@Roberto:
One thing could improve the identification: Storing the HOCR data provided by tesseract and not only the plain text.
HOCR also includes layout information. So it could by possible to combine the regex search with an layout "query" based on the HOCR data.
What do you think about extending the OCR App model DocumentPageContent with an flag indicating if the content is plain text or HOCR.
If the content is HOCR there should be an hocr-parser extracting the plain text, so the new format is not impacting the other parts of mayan.
I would by happy to support the development to extend the OCR app in this direction.

br
Matthias
PS.: Features like that could be possible by storing the HOCR data: <a href="https://github.com/shsdev/hocr-parser-hadoopjob" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fshsdev%2Fhocr-parser-hadoopjob\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEK5K5DY8Wj1lcFM5leY8fdGcyAfA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fshsdev%2Fhocr-parser-hadoopjob\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEK5K5DY8Wj1lcFM5leY8fdGcyAfA&#39;;return true;">https://github.com/shsdev/hocr-parser-hadoopjob


Am Freitag, 16. Dezember 2016 10:36:25 UTC+1 schrieb [hidden email]:
Hello,
I'm looking for a programm, which could read a document and extract informations from it. 
For example, I become a bill from Apple (the programm would recognize it, because I would have defined if in this region, there is Apple with its adress and also defined the placed which define for Apple where to find, it is a bill) and I would like to extract from it for example the bill number (which should always be on the same place) and the total price of the bill (the place of it differ, depending on the number of articles I ordered.

I unfortunatly didn't find the technical word for finding it on the web. How is this called? Is this possible with Mayan EDMS? 

I thank you already for replying and wish you a good day,

Cheers,

Sam

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Loading...