The idea

Often, when text is extracted from a scaned PDF, small glitches appear. Spaces and newlines are all over the place and take a long time to delete. Usually these are mistakes that can be fixed by finding-and-replacing, but it still takes time. I was bored finding-and replacing paragraph marks with spaces after recognising scaned text. So I made a tool that would make life easier. It automates some syntactical corrections that scanned text might need. If a paragraph doesn't start with a capital letter, it is not a paragraph.

How it works

It uses regular expressions to fix common mistakes. For example double spaces are replaced by single spaces using regex in javacscript. The code can be seen here.