OCR images with tesseract-ocr
OCR stands for optical character recognition and it’s a collection of procedures pertaining to recognizing text in images.
Computers can’t “natively” read text from images because:
- text is a stream of characters, where each character is represented with up to four bytes (depending on the character encoding),
- images are compressed streams of pixel values (colors), without any computer-readable description of what the image depicts.
Note
The “hardness” of OCR problem is exactly the reason why you see all those CAPTCHAs everywhere. They are meant to prevent automated computer submissions (e.g. thousands of requests per second, generated by a script you made).
Tesseract-OCR, developed by Google, is a free and open-source tool for reading text from images.
Task
Your task is to write a tool on top of tesseract-ocr. Your program (coded in Python or Bash) must accept a PDF or image file on input, and provide clean and as correct as possible text representation on output.
Your program should use the tesseract-ocr for most of the work, so get to know the documentation of tesseract-ocr well.
First try to convert the input file into text from the command line. Investigate text language support and image preprocessing capabilities of the tool. Image preprocessing (like cropping, sharpening, color levels, …) may significantly improve the recognition.
Search the web for what others are doing to solve the same problem. Read all basic tutorials about tesseract-ocr.
Ask your instructors for sample program input.