Scan document to text

12/24/2023

The exact mechanics of this process are complicated, but suffice to say that an OCR engine will look at pixel data and search for patterns resembling letters, numbers, and other symbols and create a digitized record of these symbols. The primary purpose of Optical Character Recognition is to quickly and automatically convert scanned images of machine-printed (typed) text - which to a computer are no more meaningful a collection of pixels than any other image, such as a landscape photo - into actual text data that you can search through and modify. We're here to give you a run-down on Optical Character Recognition, answer any questions you might have, and recommend the best OCR software for your scanning project. But what is OCR, really, and what do you need to know about it to make the best use of this sophisticated and valuable tool? You may even know that it stands for " Optical Character Recognition". Page-12-98.pdf (Size: 481.During your foray into the world of document scanning, you've probably come across the term " OCR". (3) is a comparison between small size image (not so good) and scaled up image (better). The whole point is to get text that will run through a spell checker without your intervention. Tessereact is very good but it does depend on the image quality. Well, do you know a matching font ? Since you use linux, tesseract and a front end for it should be available. Quote:Use OCR to read the text and replace the some how "broken" characters with newly rendered ones of the same font and size. Gimp makes large PDFs so exported as a very much reduced quality jpeg 45 quality / chroma halved and used that with ImageMagick to make that 500 KB PDF that fits this forum max file size. Sharpened that with a plugin gmic which has some good tools (2)

(1)įor a decent but large PDF I scaled up to USLetter size (8.5" wide 300 ppi) That gave an image size 2550 x 3161 pixels - large image. Throw in some guides and use the universal transform tool to line up that left-side edge. Some pages are not perfectly "vertical", so detecting the pages that are at an angle and rotating them would also be a huge improvement.įor that sample page it is more than just straightening. That excludes the text and includes the speckles. I fall back on the simple solution of colour select the white background with the select threshold upped to about 50. Removing al those would be a large improvement.ĭespeckle with that image does not really work. The document has in the background some single pixels (over the white background). I do not think you will have much luck for processing 400+ pagesĭo you have a better image ? The one you posted is 743 x 921 pixels which makes that size of text pixelated. Anyway, another challenge is that the document is 400+ page long, so whatever I do should be scripted, something like:ĭoes anyone have an idea on how I could tackle this? If instead of Gimp I should use another tool I'm also opened to suggestions. Use OCR to read the text and replace the some how "broken" characters with newly rendered ones of the same font and size.Some pages are not perfectly "vertical", so detecting the pages that are at an angle and rotating them would also be a huge improvement. Removing al those would be a large improvement. I think there are three things that could improve the current quality from an 80% to a 95% (subjective numbers). I have an old service manual scanned, in a very good shape in general, but would be great if I can bring it to almost perfection. I have problem that I think Gimp could help me solved, but I'm not sure how. Hey all, this is my first post here, although I have used Gimp for a long time for simple things.

0 Comments

Scan document to text

Leave a Reply.

Author

Archives

Categories