The idea

Often, when text is extracted from a scaned PDF, small glitches appear. Spaces and newlines are all over the place and take a long time to delete. Usually these are mistakes that can be fixed by finding-and-replacing, but it still takes time. I was bored finding-and replacing paragraph marks with spaces after recognising scaned text. So I made a tool that would make life easier. It automates some syntactical corrections that scanned text might need. If a paragraph doesn't start with a capital letter, it is not a paragraph.

How it works

It uses regular expressions to fix common mistakes. For example double spaces are replaced by single spaces using regex in javacscript. The code can be seen here.

Vacuum Text

The idea

How it works