[Mayan EDMS: 1909] OCR quality JPG vs. PDF

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1909] OCR quality JPG vs. PDF

Florian Beverborg
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1914] Re: OCR quality JPG vs. PDF

rosarior
Administrator
Hello,

I recently published a blog post explaining how the converter works: http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1916] Re: OCR quality JPG vs. PDF

Florian Beverborg
Hi Roberto

Using an index would keep metadata per page, right? That would not be ideal, but I'll look into that and also into SmartLinks.

Regarding pdftoppm, on the manpage is says:
-r number
Specifies the X and Y resolution, in DPI. The default is 150 DPI.

Is it possible that the DPI value saved in the JPGs (explicitly set by me to 300x300 with unit type "DPI") is not carried over to the ppm file or the OCR process? I've seen similar OCR issues with tessaract when the DPI value is not correct. Is there a way to force the DPI to 300 for all documents (all of them are scanned at 300 DPI), maybe editing the call to pdftoppm in the code as a quick fix for me? Or maybe this is already implemented as a file metadata flag?

Regards,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
Hello,

I recently published a blog post explaining how the converter works: <a href="http://www.mayan-edms.org/post/mayan-converter/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;">http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (<a href="http://www.diveintopython3.net/native-datatypes.html#slicinglists" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;">http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1925] Re: OCR quality JPG vs. PDF

Florian Beverborg
In reply to this post by rosarior
Hi Roberto

I changed the source to force pdftoppm to use 300 dpi for all files. This not only fixes the initial issue that PDF has a worse recognition quality than JPG, but indeed even improves some details regarding punctuation and the quality is now even better in the PDF.

I regard this issue as resolved now (for myself), but maybe we can find a less hacky way for all people? What I did was change line 37 of /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py to

pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')

Is there a way to open a bug report for me or how do we proceed? I guess I could supply you with a test document as well, if needed.

Cheers,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
Hello,

I recently published a blog post explaining how the converter works: <a href="http://www.mayan-edms.org/post/mayan-converter/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;">http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (<a href="http://www.diveintopython3.net/native-datatypes.html#slicinglists" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;">http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1929] Re: OCR quality JPG vs. PDF

rosarior
Administrator
Great work Florian! I will find a way to expose this via the settings system. I think it can be included in the next minor version (2.7). Yes, please, you can open an issue here: https://gitlab.com/mayan-edms/mayan-edms/issues

A test document would be even greater help. Thank you!


On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote:
Hi Roberto

I changed the source to force pdftoppm to use 300 dpi for all files. This not only fixes the initial issue that PDF has a worse recognition quality than JPG, but indeed even improves some details regarding punctuation and the quality is now even better in the PDF.

I regard this issue as resolved now (for myself), but maybe we can find a less hacky way for all people? What I did was change line 37 of /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py to

pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')

Is there a way to open a bug report for me or how do we proceed? I guess I could supply you with a test document as well, if needed.

Cheers,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
Hello,

I recently published a blog post explaining how the converter works: <a href="http://www.mayan-edms.org/post/mayan-converter/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;">http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (<a href="http://www.diveintopython3.net/native-datatypes.html#slicinglists" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;">http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1939] Re: OCR quality JPG vs. PDF

Florian Beverborg
I've created issue #416.

Regarding the quick fix you mentioned: Maybe it makes more sense to expose this as a per-document-type setting? But that would require much more development and testing, so yeah I can see why that would be nice to have for now. I've gone into more details in the issue, let's take the discussion there ;)

Cheers,
Flo

Am Freitag, 28. Juli 2017 02:02:21 UTC+2 schrieb Roberto Rosario:
Great work Florian! I will find a way to expose this via the settings system. I think it can be included in the next minor version (2.7). Yes, please, you can open an issue here: <a href="https://gitlab.com/mayan-edms/mayan-edms/issues" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fmayan-edms%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyx5gWg754SJ8JlA-nH6EcD11W8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fmayan-edms%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyx5gWg754SJ8JlA-nH6EcD11W8w&#39;;return true;">https://gitlab.com/mayan-edms/mayan-edms/issues

A test document would be even greater help. Thank you!


On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote:
Hi Roberto

I changed the source to force pdftoppm to use 300 dpi for all files. This not only fixes the initial issue that PDF has a worse recognition quality than JPG, but indeed even improves some details regarding punctuation and the quality is now even better in the PDF.

I regard this issue as resolved now (for myself), but maybe we can find a less hacky way for all people? What I did was change line 37 of /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py to

pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')

Is there a way to open a bug report for me or how do we proceed? I guess I could supply you with a test document as well, if needed.

Cheers,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
Hello,

I recently published a blog post explaining how the converter works: <a href="http://www.mayan-edms.org/post/mayan-converter/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;">http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (<a href="http://www.diveintopython3.net/native-datatypes.html#slicinglists" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;">http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 1939] Re: OCR quality JPG vs. PDF

rosarior
Administrator
Thanks!

On Friday, July 28, 2017 at 2:11:44 AM UTC-4, Florian Beverborg wrote:
I've created issue #416.

Regarding the quick fix you mentioned: Maybe it makes more sense to expose this as a per-document-type setting? But that would require much more development and testing, so yeah I can see why that would be nice to have for now. I've gone into more details in the issue, let's take the discussion there ;)

Cheers,
Flo

Am Freitag, 28. Juli 2017 02:02:21 UTC+2 schrieb Roberto Rosario:
Great work Florian! I will find a way to expose this via the settings system. I think it can be included in the next minor version (2.7). Yes, please, you can open an issue here: <a href="https://gitlab.com/mayan-edms/mayan-edms/issues" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fmayan-edms%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyx5gWg754SJ8JlA-nH6EcD11W8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgitlab.com%2Fmayan-edms%2Fmayan-edms%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyx5gWg754SJ8JlA-nH6EcD11W8w&#39;;return true;">https://gitlab.com/mayan-edms/mayan-edms/issues

A test document would be even greater help. Thank you!


On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote:
Hi Roberto

I changed the source to force pdftoppm to use 300 dpi for all files. This not only fixes the initial issue that PDF has a worse recognition quality than JPG, but indeed even improves some details regarding punctuation and the quality is now even better in the PDF.

I regard this issue as resolved now (for myself), but maybe we can find a less hacky way for all people? What I did was change line 37 of /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py to

pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')

Is there a way to open a bug report for me or how do we proceed? I guess I could supply you with a test document as well, if needed.

Cheers,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
Hello,

I recently published a blog post explaining how the converter works: <a href="http://www.mayan-edms.org/post/mayan-converter/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.mayan-edms.org%2Fpost%2Fmayan-converter%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEYWJKLhE6GXqUNa9hYzaBeNH0nTw&#39;;return true;">http://www.mayan-edms.org/post/mayan-converter/
In the case of PDF files, the utility pdftoppm is used to convert the pages into images. You can use pdftoppm on the PDF files
made by img2pdf to see the actual image Mayan is receiving and spot any degradation. 

As for your questions:
1) The OCR doesn't pre process the images before doing the recognition. This is some being worked on (already there is a scanline filter to reduce pre OCR images to 2 colors), but is not available to the user yet. When available, it will be possible to apply a stack of transformations for the document images before performing the OCR task.  
2) Strictly speaking about file types, there is no way to make a multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 support). Another JPEG format which could be used is MJPG but it is for video and it would be hackish attempt to convert the frames to pages. On the platform side, you can group images with Mayan already using an Index or a SmartLink. All the JPEG uploads need is a unique marker (like a metadata value or a filename fragment). This can be accomplished via the UI and the API. For example the index template: {{ document.label|slice:":4" }} will group all documents with the same 4 first characters in the name. To use a different part of the filename for the grouping just change the slice argument (<a href="http://www.diveintopython3.net/native-datatypes.html#slicinglists" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.diveintopython3.net%2Fnative-datatypes.html%23slicinglists\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGHgmRpclvh5Vyr4zb_qu-nI24_HQ&#39;;return true;">http://www.diveintopython3.net/native-datatypes.html#slicinglists).
 
On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
Hi all!

I'm currently evaluating Mayan as a replacement for my current DMS. The documents are all in the JPG format, multiple pages of the same document per folder, scanned at 300dpi. So far adding JPGs does not allow me to create multi-page documents. I used img2pdf to generate multi-page PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality for the same page is worse when using the PDF files.

I've tried multiple ways to generate the combined PDF and I can see some differences but never managed to get the same recognition quality as using the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG data and since I'm using PDF page size fit to image size I don't know what's going wrong here. The PDFs look fine in my PDF viewer and are reported to have correct page sizes. Generating the pages with imagemagick does not improve recognition.

This leads me to the conclusion that the PDFs are rendered internally which degrades the quality.

I have two questions:

1) What can I do to improve PDF recognition quality, either in generating the PDF or in Mayan settings?
2) Is there another way to make multi-page documents from JPGs? Maybe using the REST-API?

Using Mayan version 2.6.2

Cheers,
Flo


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.