Free and easy OCR for C# using OneNote

Although not suitable for production usage, this posting will introduce a free and easy way to equip your C# / .NET application with powerful OCR.

Surprise Pool
Surprise Pool, Yellowstone National Park – a fitting cover image in face of all the surprises you will encounter when using OCR

Introduction

“Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text”
Wikipedia on OCR

For a recent personal project, I needed to run OCR on a large number of images. Thus, I started looking for a component with the following features:

  • Programmatic access – either by providing a library or a command line interface
  • Very good recognition rates also for slightly rotated or skewed text under various lightning conditions
  • Free of charge
  • Reasonable performance

To sum it up, it should easy, but at the same time very powerful and free. Before starting my research, finding a suitable solution seemed very challenging which turned out to be true. In the end, however, I can say that I am very pleased with the result.

Choosing an OCR engine

Previously, I had heard of the tesseract engine which is a well-known open-source OCR engine. Doing some googling, I quickly discovered two promising ways to embed tesseract into a C# application: Tesseract 3 (OCR) – .NET Wrapper on stackoverflow and charlesw/tesseract on GitHub.
On the other hand, commercial alternatives like Abbyy FineReader provided much better recognition rates, but – besides not being free – did not allow programmatic access.

Thus, I came up with a quick approach on how to utilize Microsoft OneNote’s OCR capabilities and invoke them programmatically. Before continuing reading, please note that the solution is far from being perfect and can be considered more as a “hack” than as a real component. Don’t even think about using something similar in a productive environment. In my situation, however, it turned out to be suitable to get the job done. OneNote basically is a fancy note taking application similar to Evernote. One of its features include full text search which also runs text recognition on images. By right-clicking the image and selecting “Copy text from image”, this text can also be copied to the clipboard.

Usage

Let’s start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:

public interface IOcrEngine
{
string Recognize(Image image);
}

Excluding any error handling, it can be used in a way similar to the following one:

using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}

Quite simple, isn’t it?

Implementation

The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote’s OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that’s exactly what I did:

  1. Connect to OneNote using COM interop
  2. Create a temporary page containing the image to process
  3. Show the temporary page (important because OneNote won’t perform the OCR otherwise)
  4. Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
  5. Delete the temporary page

Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:

private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}

Note the usage of XName.Get(“Image”, OneNoteNamespace) (where OneNoteNamespace is the constant
“http://schemas.microsoft.com/office/onenote/2013/onenote” ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format.

Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:

int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);

Results and final thoughts

Here are a couple of examples when invoking the algorithm against some signs:

Sign Recognized text

Sign5

EXIT 140
SOUTH
Tejon St
Nevada Ave
EXIT *ONLY
Sign4 Beckley Ave
35 NORTH
ro
SOUTH
Sign3 EAST
10
Lake Charles
7th st
“2
M.L. King Pkwy
Magnolia Ave
Sign1 University «
Colorado
Colorado Springs
Colo Tech N
University
NEXT
EnsignPeak

-ENSIGN-PEÄK—
10171 ‘his point, looking northward, one has a clear view of Ensign Peak, a round
up from ‘he Iow range of which it is a par&. On JulyP26, 1847, two days after
Pioneers entere. ‘his valley, Brigham Young and party climbed to point, and
lie1U glasses 171,ade a careful survey of the mountains, canyons and streams. In Xo
Young, -the party included Heber C. Kimball, Wilford Woodruff, George A. Smith,

VVi11ard Richards, Albert Carrington, and William Clayton.
Wilford Woodruff was the first to ascend the peak,
Brigham Young the last, due to a recent illness. It was
suggested that this would be a fitting Place to “set up an
ensign for Ihe nations ” where the Lord “shall assemble the
outcasts oflsrael, and gather together the dispersed of Judah
Trom the Tour corners of the earth”, as foretold in
Isaiah 11:12. It was then named Ensign Peak, and in later
years a standard was erected on its summit.

The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project.
One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognize text from some images reproducibly crashes OneNote.

Code / Download

Check out the code at GitHub and download the binaries from here.

  • an ho

    Hi, can you give me your source code.

    Please contact for me: hohoangan2014@gmail.com @disqus_wPpl7WlLdm:disqus

    Thank you.

  • Dinos Konstantinou

    Interesting approach!
    One minor inaccuracy: ABBYY offers a product for programmatic access (an SDK that is), they call it FineReader Engine.
    It is a pain in the ar** to get it and quite expensive as well…

  • Umamaheswaran Venkatasubramani

    I am getting an exception when tried to execute your code from GIT:

    Exception from HRESULT: 0x80042020

    This is thrown at:

    this._app.GetHierarchy(String.Empty, HierarchyScope.hsPages, out hierarchy);

    on LoadOrCreatePage() method.