[Mayan EDMS: 2460] Error with OCR in Spanish - Mayan 2.7

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 2460] Error with OCR in Spanish - Mayan 2.7

Pablo Castro
Hello,

I installed Mayan with the following guide: https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/

Which means I have 2 docker containers with Mayan-EDMS and MySQL running in an Ubuntu box.

I tried the OCR function but was getting the following error in the OCR errors log:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column 'content' at row 1")

Tried with a different document and got a similar error:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column 'content' at row 1")

I assumed it was because the documents were being uploaded with "English" as the document language, so I changed the default document language as follows:


I modified the local.py file under var/lib/docker/volumes/mayan_data/_data/settings and added the following lines:

DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), ('spa', 'Spanish'))
DOCUMENTS_LANGUAGE
= 'spa'

This worked fine and now the default language when adding a new document is Spanish and the list contains just spanish, english and german.

Afterwards, I modified the envfile to install the spansh tesseract package

# MySQL container
MYSQL_ROOT_PASSWORD
=********
MYSQL_PASSWORD
=*********
MYSQL_DATABASE
=mayan_db
MYSQL_USER
=mayan_user

# Mayan container
MAYAN_DATABASE_DRIVER
=django.db.backends.mysql
MAYAN_DATABASE_NAME
=mayan_db
MAYAN_DATABASE_USER
=mayan_user
MAYAN_DATABASE_PASSWORD
=********
MAYAN_DATABASE_HOST
=mayan-mysql
MAYAN_DATABASE_PORT
=3306
MAYAN_APT_INSTALLS
=libsasl2-dev python-dev libldap2-dev libssl-dev tesseract-ocr-spa
MAYAN_PIP_INSTALLS
=python-ldap==2.4.41 django-auth-ldap==1.2.14

I assumed this should be enough for OCR to be working in spanish, so I restarted the docker container and uploaded a document for OCR

OCR is still not working, and there's no error log under the OCR errors tool.

I checked the docker logs for the mayan-edms container and found this:

Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'spa'
Tesseract couldn't load any languages!
[2018-05-11 16:55:37,489: ERROR/MainProcess] Task ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised unexpected: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

Has anyone experienced something similar? I am still searching for ways to modify the TESSDATA_PREFIX environment variable but my experience with docker is limited.

Any help is appreciated.


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

[Mayan EDMS: 2463] Re: Error with OCR in Spanish - Mayan 2.7

Pablo Castro
UPDATE

I was able to get the spanish OCR working by simply deleting the mayan-edms docker container and running it again, this successfully installed tesseract-ocr-spa.deb



On Friday, 11 May 2018 12:37:39 UTC-5, Pablo Castro wrote:
Hello,

I installed Mayan with the following guide: <a href="https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.mayan-edms.com%2Fpost%2Fdeploy-mayan-docker-mysql%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH2m38sg-Su3DixA2C9CtHwIA-L7A&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.mayan-edms.com%2Fpost%2Fdeploy-mayan-docker-mysql%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH2m38sg-Su3DixA2C9CtHwIA-L7A&#39;;return true;">https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/

Which means I have 2 docker containers with Mayan-EDMS and MySQL running in an Ubuntu box.

I tried the OCR function but was getting the following error in the OCR errors log:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column 'content' at row 1")

Tried with a different document and got a similar error:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column 'content' at row 1")

I assumed it was because the documents were being uploaded with "English" as the document language, so I changed the default document language as follows:


I modified the local.py file under var/lib/docker/volumes/mayan_data/_data/settings and added the following lines:

DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), ('spa', 'Spanish'))
DOCUMENTS_LANGUAGE
= 'spa'

This worked fine and now the default language when adding a new document is Spanish and the list contains just spanish, english and german.

Afterwards, I modified the envfile to install the spansh tesseract package

# MySQL container
MYSQL_ROOT_PASSWORD
=********
MYSQL_PASSWORD
=*********
MYSQL_DATABASE
=mayan_db
MYSQL_USER
=mayan_user

# Mayan container
MAYAN_DATABASE_DRIVER
=django.db.backends.mysql
MAYAN_DATABASE_NAME
=mayan_db
MAYAN_DATABASE_USER
=mayan_user
MAYAN_DATABASE_PASSWORD
=********
MAYAN_DATABASE_HOST
=mayan-mysql
MAYAN_DATABASE_PORT
=3306
MAYAN_APT_INSTALLS
=libsasl2-dev python-dev libldap2-dev libssl-dev tesseract-ocr-spa
MAYAN_PIP_INSTALLS
=python-ldap==2.4.41 django-auth-ldap==1.2.14

I assumed this should be enough for OCR to be working in spanish, so I restarted the docker container and uploaded a document for OCR

OCR is still not working, and there's no error log under the OCR errors tool.

I checked the docker logs for the mayan-edms container and found this:

Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'spa'
Tesseract couldn't load any languages!
[2018-05-11 16:55:37,489: ERROR/MainProcess] Task ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised unexpected: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

Has anyone experienced something similar? I am still searching for ways to modify the TESSDATA_PREFIX environment variable but my experience with docker is limited.

Any help is appreciated.


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [Mayan EDMS: 2463] Re: Error with OCR in Spanish - Mayan 2.7

rosarior
Administrator
Thanks for update Pablo, this will help fix the issue faster.

On Fri, May 11, 2018, 4:18 PM Pablo Castro <[hidden email]> wrote:
UPDATE

I was able to get the spanish OCR working by simply deleting the mayan-edms docker container and running it again, this successfully installed tesseract-ocr-spa.deb



On Friday, 11 May 2018 12:37:39 UTC-5, Pablo Castro wrote:
Hello,

I installed Mayan with the following guide: https://www.mayan-edms.com/post/deploy-mayan-docker-mysql/

Which means I have 2 docker containers with Mayan-EDMS and MySQL running in an Ubuntu box.

I tried the OCR function but was getting the following error in the OCR errors log:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81\\x0A21...' for column 'content' at row 1")

Tried with a different document and got a similar error:

(1366, "Incorrect string value: '\\xEF\\xAC\\x81eio...' for column 'content' at row 1")

I assumed it was because the documents were being uploaded with "English" as the document language, so I changed the default document language as follows:


I modified the local.py file under var/lib/docker/volumes/mayan_data/_data/settings and added the following lines:

DOCUMENTS_LANGUAGE_CHOICES = (('deu', 'Deutsch'),('eng', 'English'), ('spa', 'Spanish'))
DOCUMENTS_LANGUAGE
= 'spa'

This worked fine and now the default language when adding a new document is Spanish and the list contains just spanish, english and german.

Afterwards, I modified the envfile to install the spansh tesseract package

# MySQL container
MYSQL_ROOT_PASSWORD
=********
MYSQL_PASSWORD
=*********
MYSQL_DATABASE
=mayan_db
MYSQL_USER
=mayan_user

# Mayan container
MAYAN_DATABASE_DRIVER
=django.db.backends.mysql
MAYAN_DATABASE_NAME
=mayan_db
MAYAN_DATABASE_USER
=mayan_user
MAYAN_DATABASE_PASSWORD
=********
MAYAN_DATABASE_HOST
=mayan-mysql
MAYAN_DATABASE_PORT
=3306
MAYAN_APT_INSTALLS
=libsasl2-dev python-dev libldap2-dev libssl-dev tesseract-ocr-spa
MAYAN_PIP_INSTALLS
=python-ldap==2.4.41 django-auth-ldap==1.2.14

I assumed this should be enough for OCR to be working in spanish, so I restarted the docker container and uploaded a document for OCR

OCR is still not working, and there's no error log under the OCR errors tool.

I checked the docker logs for the mayan-edms container and found this:

Error opening data file /usr/share/tesseract-ocr/tessdata/spa.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'spa'
Tesseract couldn't load any languages!
[2018-05-11 16:55:37,489: ERROR/MainProcess] Task ocr.tasks.task_do_ocr[fb11d940-faaa-4d51-8eb1-a20227ced574] raised unexpected: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

Has anyone experienced something similar? I am still searching for ways to modify the TESSDATA_PREFIX environment variable but my experience with docker is limited.

Any help is appreciated.


--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.