by David Cox
To download and install the macro (actual complete macro code) go here. For how to install this macro into MSWord, go here.
(1) Introduction: This page explains a MS Word Macro that I made/am constantly upgrading and remaking. The purpose of this macro is to automatically reformat and "clean up" the text from PDF scans of books. Typically when a book is scanned, and then OCRed, there is a lot of junk, trash, misread letters, etcetera. My context is in cleaning up religious books basically in English, but frequently with Greek, Hebrew, Latin, and even German and French thrown in to really mess up stuff. Many of these books are 300+ pages with some reaching up into 3,000 or 4,000 pages of text. This becomes impossible for a person to handle such things without an automated program to do it for you (hence the MS Word macro).
Let me just say that I doubt this macro will fix all the problems in OCR scanned text. But that is not the point. The point is to reduce the work involved in preparing such text for use is so great, that even if this macro works 95% or 99% of the time, when there are 500 of some error in the text, the macro will reduce the amount of user work greatly. This justifies for me the small percent of the time that it doesn't work correctly. It is important to always, always visually scan the finished text for errors, and perhaps abandon the text the macro has run on and go back to the original copy anyway.
I have searched high and low for information on macros to do this work automatically, and perhaps this can help others in the same situation.
(2) Plan of presentation of this page:
(1) Introduction.
(2) Plan of presentation and overview.
(3) Identify specific elements to be solved as well as problems.
(4) Present the Macro ready for copying to MS Word.
(5) Help in making installing the macro, making it useful, easy to get to, etc.
(3) Identify specific elements to be solved as well as problems: First of all, let me admit that this macro is "imperfect" from the beginning and it will. This macro works on the basis of certain text being in the original in a certain format. If the book printer/typesetter that made the original book did not follow norms and standards, or even if he did but did not follow what I am assuming in the macro, this macro either (1) will not find and make correct changes, or worse, (2) it will corrupt the text.
IMPORTANT: Because of the above, it is always recommended that you follow this procedure: (1) Only work on a copy of your text, not the original. (2) Always scan the finished document to see if there is some kind of gross corruption of the text going on (if so, send me a copy of the parts of the text the macro is goofing up and I will try to understand it, analyze it, and try to fix it.)
DoFindReplace
Before we begin the actual problems and solutions, let's comment on a larger solution. A typical search and replace in Visual Basic for Word takes about 13 lines of code to prepare and execute a search and replace function. This makes a ridiculously large macro, and the larger it is, the harder it is to debug. Therefore I have made a subroutine macro called "DoFindReplace" which will allow me to make a one line of code to set and execute the search text with the replace text, and greatly condense the code behind our work. If you are familiar with programming, then I will pass the search text and replacement text to the subroutine which will "do the grunt work" of the macro.
Sub DoFindReplace(FindText As String, _
ReplaceText As String, _
bMatchWildcards As Boolean, _
bCase As Boolean, _
bWholeWord As Boolean)' Parameters: "text", "text", True/False, True/False, True/False,
'FindText - The text to search for
'ReplaceText - The replacement text
'bMatchWildcards - True/False, true if we are using RegEx expressions
'bCase - True/False, True is match case
'bWholeWord - True/False, True Match only whole words.With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
' These two reset the formatting for both search text and
' replacement text as a safety measure
.Text = FindText
.Replacement.Text = ReplaceText
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = bCase
.MatchWholeWord = bWholeWord
.MatchWildcards = bMatchWildcards
.MatchSoundsLike = False
.MatchAllWordForms = False
Do While .Execute
'Keep going until nothing found
.Execute Replace:=wdReplaceAll
Loop
'Free up some memory
ActiveDocument.UndoClear
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
End With
End Sub
Basically this is a subroutine to do the repetitive stuff in one set of code, without having to grow our macros excessively.
Specific Problems in OCR Scanned Texts
A. Our first headache is the addition of spaces around punctuation.
What this problem is simply is that the OCR process sometimes makes text with extra spaces, like this.
comprehensive survey of a great properly speaking
,¶
an illimitable subject. Greater
( completeness
) in the ¶
study of this topic is indeed most desirable
, but ¶
perhaps completeness of plan and systematic outline ¶
are not the chief requisites in an attempt to describe ¶
the influence upon the human spirit of that Divine ¶
So our first step is to clean this up, eliminating these extra spaces. Our key here is that the text already has an identifying mark, a punctuation mark with extra spaces around it.
Call DoFindReplace(" : ", ": ", False, False, False)
Call DoFindReplace(" ; ", "; ", False, False, False)
Call DoFindReplace(" , ", ", ", False, False, False)
Call DoFindReplace(" ? ", "? ", False, False, False)
Call DoFindReplace(" ! ", "! ", False, False, False)
Call DoFindReplace(" ( ", " (", False, False, False)
Call DoFindReplace(" ) ", ") ", False, False, False)
StatusBar = "replacing punctuation + extra space"
B. Our next problem is that the OCR process "eats" the apostrophes sometimes.
Like "the man s coat" or "don t do that". A lot of these things we will simply never be able to automatically replace because in this example "the cats' heads", the apostrophe is difficult to discern. Anybody got ideas on how to reliably identify this text, I will try to implement code for that. Our problem here is that we need to make this code look for
StatusBar = "finding stray Apostrophes before s and t"
' Before running the code below, we need to shorten up spaces before
' periods, because of outlining, like A . B . etc. S .
Call DoFindReplace(" .", " .", False, True, False)
Call DoFindReplace(" .", " .", False, True, False)
Call DoFindReplace(" .", " .", False, True, False)
'Exception: If there is an outline that reaches s and t, and it has
' no period after the letter, this will corrupt text.
Call DoFindReplace(" s ", "'s ", False, True, False)
Call DoFindReplace(" S ", "'S ", False, True, False)
Call DoFindReplace(" t ", "'t ", False, True, False)
Call DoFindReplace(" T ", "'T ", False, True, False)
Please note the Exception which explains the condition in which this code WILL CORRUPT THE TEXT.
C. Our next is our biggest headache which is the line breaks at the end of each line.
Simply put, this is the biggest headache of working with OCR scanned imported text. So these sections of code will do the "hard stuff."
Sample of Problem
comprehensive survey of a great properly speaking,¶
an illimitable subject. Greater completeness in the ¶
study of this topic is indeed most desirable, but ¶
perhaps completeness of plan and systematic outline ¶
are not the chief requisites in an attempt to describe ¶
the influence upon the human spirit of that Divine ¶
Breath which bloweth where it listeth, and whose ¶
chief characteristic it is to surpass human thought ¶
Here the problem is in eliminating the extra line breaks but not getting rid of all paragraph marks. In general a double paragraph will indicate a "real paragraph break", but also there are areas of 2, 3, 4, 5, up to 10 paragraphs one after another that messes up our strategy. So the more we can leave the text as it originally is, the better. So my strategy here is to search and replace based on certain conditions of the letters before and after the paragraph mark.
RULE: I assume that paragraph final punctuation is basically only a period ".", a question mark "?", and an exclamation mark "!". This may not be true always, but it is a good general "rule of thumb." For an example of when it is not true, this happens when a scanned text hits a page break, and not all lines end with final punctuation.
in a regular, unbroken round of prosperous,
com ¶
fortable existence. The wine that settles on its lees ¶
N 2 ¶
¶
180 THE TIDES OF THE SPIRIT ¶
and is not emptied from vessel to vessel preserves its ¶
original taste and flavour unweakened. This process¶
Here we see that a page break has happened, and I have no idea what "N2" means, but there is a header or footer with the page number, and the text continues. This just will screw us up, BUT normally there are extra paragraph marks around the header which will put some space around it and make it obvious what has happened. At one time I wanted to remove these headers/footers but decided against it because in my scanned texts they are notoriously corrupted, never getting the book title or the chapter title right, and to be constantly guessing and changing the header/footer information makes removing it next to impossible. Besides that, the paging of the original work is necessary for those books that have Scripture Indexes and such, and most all books have a table of contents that becomes useless if we remove all page numbering from the text.
Our strategy here is going to selectively replace paragraph marks for the text based on whether there is a final punctuation mark before the paragraph mark. Before we start this, let's make some remarks on the second greatest problem of OCR scanned texts. Notice the white background text in the sample text above. Here the original printer has divided a word "comfort" but has not done us the favor of putting in a hyphen to mark that separation. This makes our finished text to have errors like "prosperous, com fortable existence." At presence I have no way of "fixing" this problem, so we will try a series of searches and replaces based on common prefixes and suffixes, and try to fix a good number of these types of errors. Note that we will not fix everything, but we will fix a lot, and at least that will help us on the fixing these things by hand after the macro has run.
Basically we are going to divide this into two parts, prefixes and suffixes. Our rule is that the things we are looking for are not correct English by themselves. For example, " com " is not a word in English. One common prefix which we have to "not touch" (i.e. not look for nor replace) is " in " because it is a valid word by itself. So what I did is not really a complete fix, but it will get a lot of errors out of the way for us. Basically I looked for common prefixes and suffixes in English, and when we encounter them alone (separated before and after with commas) we replace them without the trailing space. This is caused by book printers who separate these words, and when scanned and OCRed, makes the error.
I am not going to copy all the macro code here, just a few to explain them. (The entire macro code is at the bottom of this page.)
StatusBar = "Replacing special known separated words, like con stant"
'These replace broken words based on prefix
' Prefixes which are not replaced a, in, out, over, post, be
' contra, counter, extra
Call DoFindReplace(" ad ^p", " ad", False, True, False)
Call DoFindReplace(" ad ^p", " ad", False, True, False)
Call DoFindReplace(" ac ^p", " ac", False, True, False)
Call DoFindReplace(" anti ^p", " anti", False, True, False)
etcetera
What this code does is look for these prefixes which should be part of a word, but they are sitting alone, surrounded front by a space, and back with a space paragraph mark (^p is paragraph mark in Word Macro language and search and replace dialogs). We replace them getting rid of the extra space, but we also get rid of the paragraph mark while we are at it.
EXCEPTION: Note that if your text does not have a space at the end of each paragraph, then you might need to search and replace for the paragraph mark before running the macro.
Search Box Text: "^p"
Replace Box Text: " ^p" (space paragraph mark)
'broken words based on suffix
' Suffixes which are not replaced able, er,
' The following code assumes there is not a space before the first word.
Call DoFindReplace("^pance ", "ance ", False, True, False)
Call DoFindReplace("^pally ", "ally ", False, True, False)
Call DoFindReplace("^pality ", "ality ", False, True, False)
etcetera
What this code does is look for suffixes that are immediately after a paragraph mark.
Now comes the fun stuff. First of all, we don't want to collapse our original spacing if possible. So we will first replace all double paragraphs with a placeholder (I am using <para> here but anything could be used in its place). At the end of the macro this will be replaced with paragraph marks. The reason for this is to preserve at least 1 double paragraph.
'Search for Spacing paragraphs.
Call DoFindReplace("^p^p", "<para>", False, False, False)
Our code will search for paragraph marks that have a lower case letter or number before a paragraph with no punctuation marks.
StatusBar = "getting rid of paragraph marks at not sentence final positions"
'This will try to fix line breaks but not at sentence endpoints.
Call DoFindReplace("0 ^p", "0 ", False, False, False)
Call DoFindReplace("1 ^p", "1 ", False, False, False)
Call DoFindReplace("2 ^p", "2 ", False, False, False)
Call DoFindReplace("3 ^p", "3 ", False, False, False)
Call DoFindReplace("4 ^p", "4 ", False, False, False)
Call DoFindReplace("5 ^p", "5 ", False, False, False)
Call DoFindReplace("6 ^p", "6 ", False, False, False)
Call DoFindReplace("7 ^p", "7 ", False, False, False)
Call DoFindReplace("8 ^p", "8 ", False, False, False)
Call DoFindReplace("9 ^p", "9 ", False, False, False)
Call DoFindReplace("a ^p", "a ", False, True, False)
Call DoFindReplace("b ^p", "b ", False, True, False)
Call DoFindReplace("c ^p", "c ", False, True, False)
Call DoFindReplace("d ^p", "d ", False, True, False) etc.
This section of code is repeated twice, and the second time for paragraphs that have a space before the paragraph mark.
Now we will replace lower case letters beginning sentences after a question mark
'This will correct this error xaxadx axax? lowercase
Call DoFindReplace("? a", "? A", False, True, False)
Call DoFindReplace("? b", "? B", False, True, False)
Call DoFindReplace("? c", "? C", False, True, False)
Call DoFindReplace("? d", "? D", False, True, False)
Call DoFindReplace("? e", "? E", False, True, False) etc
In the above code, there will be some situations that will not have the line feed removed, notably when the line ends with either of these (,;-)
'Lines that end in a comma or semicolon, e.g.
Call DoFindReplace(",^p", ", ", False, False, False)
Call DoFindReplace(";^p", "; ", False, False, False)
Call DoFindReplace("- ^p", "-", False, False, False)
Now we finish up with getting rid of extra spacing anywhere in the text, and replacing our double paragraphs.
StatusBar = " Returning <para> to paragraphs"
'Replace Spacing paragraphs.
Call DoFindReplace("<para>", "^p^p", False, False, False)
StatusBar = "Getting rid of extra spaces"
'Remove extra spaces
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
StatusBar = "Macro Finished (created by David Cox davidcox.com.mx/eswordmodules/macro_fixbrokenlines.htm)"
(4) Present the Macro. Copy and paste this macro code into your Word Macros. Normally it is Tools, Macros, Visual Basic.
Note that this macro is actually two macros, CleanText, and DoFindReplace. You must have both for it to work.
Sub cleantext()
'
' cleantext Macro
' Macro created by David R. Cox on March 28, 2008
' Purpose: Used for cleaning up text imported from OCR type original sources,
' or PDF copy and paste where there are line breaks at the end physical lines, but
' are not at the end of a sentence.
' This macro needs the subroutine of DoFindReplace which makes a long list of substitutions more readable.
'Quotation Marks Alt 0147 Beg, Alt 0148 end ["|“]*["|”]
'["|“] [A-z]{3,25} ["|”|“]
'This will clear the formatting left over from a previous macro.
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
'Common Formatting problems
StatusBar = "finding stray Apostrophes before s and t"
'Exception: If there is an outline that reaches s and t, and it has
' no period after the letter, this will corrupt text.
Call DoFindReplace(" s ", "'s ", False, True, False)
Call DoFindReplace(" S ", "'S ", False, True, False)
'Call DoFindReplace(" t ", "'t ", False, True, False)
Call DoFindReplace(" T ", "'T ", False, True, False)
'Remove space after paragraph mark
'Call DoFindReplace("^p ", "^p", False, False, False)
'Call DoFindReplace("^p", " ^p", False, False, False)
'Call DoFindReplace(" ^p", " ^p", False, False, False)
' This loops endlessly
StatusBar = "replacing punctuation + extra space"
'Here we add a space before punctuation because examples such as
'common alities? or common alities" will not be found otherwise.
'At the end we will remove white space before puntuction.
'Preliminary problems that will mess up stuff later.
Call DoFindReplace("St. ^p", "St. ", False, True, False)
Call DoFindReplace("Mr. ^p", "Mr. ", False, True, False)
Call DoFindReplace("Mrs. ^p", "Mrs. ", False, True, False)
Call DoFindReplace("Dr. ^p", "Dr. ", False, True, False)
Call DoFindReplace("Drs. ^p", "Drs. ", False, True, False)
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = ")"
.Replacement.Text = " )"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = "!"
.Replacement.Text = " !"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = "?"
.Replacement.Text = " ?"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = ","
.Replacement.Text = " ,"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = ";"
.Replacement.Text = " ;"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = ":"
.Replacement.Text = " :"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = "."
.Replacement.Text = " ."
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
StatusBar = "Replacing special known separated words, like con stant"
'These replace broken words based on prefix
' Prefixes which are not replaced a, in, out, over, post, be
' contra, counter, extra
Call DoFindReplace(" ad ^p", " ad", False, True, False)
Call DoFindReplace(" ad ^p", " ad", False, True, False)
Call DoFindReplace(" ac ^p", " ac", False, True, False)
Call DoFindReplace(" anti ^p", " anti", False, True, False)
Call DoFindReplace(" ante ^p", " ante", False, True, False)
Call DoFindReplace(" asm ^p", " asm", False, True, False)
Call DoFindReplace(" ary ^p", " ary", False, True, False)
Call DoFindReplace(" ated ^p", " ated", False, True, False)
Call DoFindReplace(" atry ^p", " atry", False, True, False)
Call DoFindReplace(" bi ^p", " bi", False, True, False)
Call DoFindReplace(" bene ^p", " bene", False, True, False)
Call DoFindReplace(" cent ^p", " cent", False, True, False)
Call DoFindReplace(" centi ^p", " centi", False, True, False)
Call DoFindReplace(" circum ^p", " circum", False, True, False)
Call DoFindReplace(" co ^p", " co", False, True, False)
Call DoFindReplace(" col ^p", " col", False, True, False)
Call DoFindReplace(" com ^p", " com", False, True, False)
Call DoFindReplace(" Com ^p", " Com", False, True, False)
Call DoFindReplace(" con ^p", " con", False, True, False)
Call DoFindReplace(" counter ^p", " counter", False, True, False)
Call DoFindReplace(" de ^p", " de", False, True, False)
Call DoFindReplace(" des ^p", " des", False, True, False)
Call DoFindReplace(" dis ^p", " dis", False, True, False)
Call DoFindReplace(" dia ^p", " dia", False, True, False)
Call DoFindReplace(" dys ^p", " dys", False, True, False)
Call DoFindReplace(" ex ^p", " ex", False, True, False)
Call DoFindReplace(" en ^p", " en", False, True, False)
Call DoFindReplace(" em ^p", " em", False, True, False)
Call DoFindReplace(" fore ^p", " fore", False, True, False)
Call DoFindReplace(" fre ^p", " fre", False, True, False)
Call DoFindReplace(" hypo ^p", " hypo", False, True, False)
Call DoFindReplace(" hyper ^p", " hyper", False, True, False)
Call DoFindReplace(" inter ^p", " inter", False, True, False)
Call DoFindReplace(" infra ^p", " infra", False, True, False)
Call DoFindReplace(" intra ^p", " intra", False, True, False)
Call DoFindReplace(" im ^p", " im", False, True, False)
Call DoFindReplace(" il ^p", " il", False, True, False)
Call DoFindReplace(" ir ^p", " ir", False, True, False)
Call DoFindReplace(" ized ^p", " ized", False, True, False)
Call DoFindReplace(" ob ^p", " ob", False, True, False)
Call DoFindReplace(" oc ^p", " oc", False, True, False)
Call DoFindReplace(" multi ^p", " multi", False, True, False)
Call DoFindReplace(" mes ^p", " mes", False, True, False)
Call DoFindReplace(" mani ^p", " mani", False, True, False)
Call DoFindReplace(" peri ^p", " peri", False, True, False)
Call DoFindReplace(" por ^p", " por", False, True, False)
Call DoFindReplace(" par ^p", " par", False, True, False)
Call DoFindReplace(" para ^p", " para", False, True, False)
Call DoFindReplace(" pro ^p", " pro", False, True, False)
Call DoFindReplace(" pre ^p", " pre", False, True, False)
Call DoFindReplace(" per ^p", " per", False, True, False)
Call DoFindReplace(" pur ^p", " pur", False, True, False)
Call DoFindReplace(" mis ^p", " mis", False, True, False)
Call DoFindReplace(" re ^p", " re", False, True, False)
Call DoFindReplace(" semi ^p", " semi", False, True, False)
Call DoFindReplace(" sub ^p", " sub", False, True, False)
Call DoFindReplace(" sus ^p", " sus", False, True, False)
Call DoFindReplace(" syn ^p", " syn", False, True, False)
Call DoFindReplace(" tele ^p", " tele", False, True, False)
Call DoFindReplace(" trans ^p", " trans", False, True, False)
Call DoFindReplace(" tran ^p", " tran", False, True, False)
Call DoFindReplace(" tri ^p", " tri", False, True, False)
Call DoFindReplace(" ultra ^p", " ultra", False, True, False)
Call DoFindReplace(" uni ^p", " uni", False, True, False)
Call DoFindReplace(" un ^p", " un", False, True, False)
Call DoFindReplace(" unim ^p", " unim", False, True, False)
'broken words based on suffix
' Suffixes which are not replaced able, er,
' The following code assumes there is not a space before the first word.
' ERRORS: These will not replace things with puntuation or parentheses
' like ings, or ings,"
Call DoFindReplace(" ^pance ", "ance ", False, True, False)
Call DoFindReplace(" ^pally ", "ally ", False, True, False)
Call DoFindReplace(" ^pality ", "ality ", False, True, False)
Call DoFindReplace(" ^pbility ", "bility ", False, True, False)
Call DoFindReplace(" ^pbilities ", "bilities ", False, True, False)
Call DoFindReplace(" ^pastic ", "astic ", False, True, False)
Call DoFindReplace(" ^pation ", "ation ", False, True, False)
Call DoFindReplace(" ^pated ", "ated ", False, True, False)
Call DoFindReplace(" ^pations ", "ations ", False, True, False)
Call DoFindReplace(" ^pature ", "ature ", False, True, False)
Call DoFindReplace(" ^pcally ", "cally ", False, True, False)
Call DoFindReplace(" ^pcance ", "cance ", False, True, False)
Call DoFindReplace(" ^pcated ", "cated ", False, True, False)
Call DoFindReplace(" ^pciate ", "ciate ", False, True, False)
Call DoFindReplace(" ^pcism ", "cism ", False, True, False)
Call DoFindReplace(" ^pceed ", "ceed ", False, True, False)
Call DoFindReplace(" ^pdard ", "dard ", False, True, False)
Call DoFindReplace(" ^pdom ", "dom ", False, True, False)
Call DoFindReplace(" ^pestness ", "estness ", False, True, False)
Call DoFindReplace(" ^pen ", "en ", False, True, False)
Call DoFindReplace(" ^pers ", "ers ", False, True, False)
Call DoFindReplace(" ^pence", "ence ", False, True, False)
Call DoFindReplace(" ^pences ", "ences ", False, True, False)
Call DoFindReplace(" ^pered ", "ered ", False, True, False)
Call DoFindReplace(" ^pfied ", "fied ", False, True, False)
Call DoFindReplace(" ^pful ", "ful ", False, True, False)
Call DoFindReplace(" ^pible ", "ible ", False, True, False)
Call DoFindReplace(" ^piarly ", "iarly ", False, True, False)
Call DoFindReplace(" ^pify ", "ify ", False, True, False)
Call DoFindReplace(" ^ping ", "ing ", False, True, False)
Call DoFindReplace(" ^pings ", "ings", False, True, False)
Call DoFindReplace(" ^pism ", "ism ", False, True, False)
Call DoFindReplace(" ^pisms ", "isms ", False, True, False)
Call DoFindReplace(" ^pist ", "ist ", False, True, False)
Call DoFindReplace(" ^pistic ", "istic ", False, True, False)
Call DoFindReplace(" ^pists ", "ists ", False, True, False)
Call DoFindReplace(" ^pise ", "ise ", False, True, False)
Call DoFindReplace(" ^pity ", "ity ", False, True, False)
Call DoFindReplace(" ^pities ", "ities ", False, True, False)
Call DoFindReplace(" ^pize ", "ize ", False, True, False)
Call DoFindReplace(" ^plated ", "lated ", False, True, False)
Call DoFindReplace(" ^plieve ", "lieve ", False, True, False)
Call DoFindReplace(" ^plute ", "lute ", False, True, False)
Call DoFindReplace(" ^pgion ", "gion ", False, True, False)
Call DoFindReplace(" ^pmani ", "mani ", False, True, False)
Call DoFindReplace(" ^pment ", "ment ", False, True, False)
Call DoFindReplace(" ^pmenon ", "menon ", False, True, False)
Call DoFindReplace(" ^pmena ", "mena ", False, True, False)
Call DoFindReplace(" ^pments ", "ments ", False, True, False)
Call DoFindReplace(" ^pmon ", "mon ", False, True, False)
Call DoFindReplace(" ^pmons ", "mons ", False, True, False)
' Note: mons is a Latin word and could be falsely conjoined
Call DoFindReplace(" ^pnant ", "nant ", False, True, False)
Call DoFindReplace(" ^pnent ", "nent ", False, True, False)
Call DoFindReplace(" ^pness ", "ness ", False, True, False)
Call DoFindReplace(" ^pneous ", "neous ", False, True, False)
Call DoFindReplace(" ^pnity ", "nity ", False, True, False)
Call DoFindReplace(" ^pnize ", "nize ", False, True, False)
Call DoFindReplace(" ^pnizes ", "nizes ", False, True, False)
Call DoFindReplace(" ^pning ", "ning ", False, True, False)
Call DoFindReplace(" ^prity ", "rity ", False, True, False)
Call DoFindReplace(" ^pously ", "ously ", False, True, False)
Call DoFindReplace(" ^pous ", "ous ", False, True, False)
Call DoFindReplace(" ^psary ", "sary ", False, True, False)
Call DoFindReplace(" ^psarily ", "sarily ", False, True, False)
Call DoFindReplace(" ^psant ", "sant ", False, True, False)
Call DoFindReplace(" ^psible ", "sible ", False, True, False)
Call DoFindReplace(" ^psibly ", "sibly ", False, True, False)
Call DoFindReplace(" ^psion ", "sion ", False, True, False)
Call DoFindReplace(" ^psions ", "sions ", False, True, False)
Call DoFindReplace(" ^psive ", "sive ", False, True, False)
Call DoFindReplace(" ^ptery ", "tery ", False, True, False)
Call DoFindReplace(" ^ptary ", "tary ", False, True, False)
Call DoFindReplace(" ^ptative ", "tative ", False, True, False)
Call DoFindReplace(" ^ptice ", "tice ", False, True, False)
Call DoFindReplace(" ^ptical ", "tical ", False, True, False)
Call DoFindReplace(" ^ptain ", "tain ", False, True, False)
Call DoFindReplace(" ^ptaining ", "taining ", False, True, False)
Call DoFindReplace(" ^ptained ", "tained ", False, True, False)
Call DoFindReplace(" ^ptains ", "tains ", False, True, False)
Call DoFindReplace(" ^ptant ", "tant ", False, True, False)
Call DoFindReplace(" ^ptent ", "tent ", False, True, False)
Call DoFindReplace(" ^pteer ", "teer ", False, True, False)
Call DoFindReplace(" ^pteers ", "teers ", False, True, False)
Call DoFindReplace(" ^ptian ", "tian ", False, True, False)
Call DoFindReplace(" ^pties ", "ties ", False, True, False)
Call DoFindReplace(" ^ptine ", "tine ", False, True, False)
Call DoFindReplace(" ^ption ", "tion ", False, True, False)
Call DoFindReplace(" ^ptions ", "tions ", False, True, False)
Call DoFindReplace(" ^ptive ", "tive ", False, True, False)
Call DoFindReplace(" ^ptize ", "tize ", False, True, False)
Call DoFindReplace(" ^ptizes ", "tizes ", False, True, False)
Call DoFindReplace(" ^ptized ", "tized ", False, True, False)
Call DoFindReplace(" ^ptively ", "tively ", False, True, False)
Call DoFindReplace(" ^ptual ", "tual ", False, True, False)
Call DoFindReplace(" ^ptude ", "tude ", False, True, False)
Call DoFindReplace(" ^ptudes ", "tudes ", False, True, False)
Call DoFindReplace(" ^pture ", "ture ", False, True, False)
Call DoFindReplace(" ^ptures ", "tures ", False, True, False)
Call DoFindReplace(" ^pulate ", "ulate ", False, True, False)
Call DoFindReplace(" ^pully ", "ully ", False, True, False)
Call DoFindReplace(" ^pual ", "ual ", False, True, False)
Call DoFindReplace(" ^puable ", "uable ", False, True, False)
Call DoFindReplace(" ^pume ", "ume ", False, True, False)
Call DoFindReplace(" ^pvour ", "vour ", False, True, False)
'Search for paragraphs with ending punctuation before it.
Call DoFindReplace(".^p", ".<para>", False, False, False)
Call DoFindReplace("?^p", "?<para>", False, False, False)
Call DoFindReplace("!^p", "!<para>", False, False, False)
Call DoFindReplace(".)^p", ".)<para>", False, False, False)
Call DoFindReplace("?)^p", "?)<para>", False, False, False)
Call DoFindReplace("!)^p", "!)<para>", False, False, False)
StatusBar = "Saving sentence final paragraphs to <para>"
'Search for Spacing paragraphs.
Call DoFindReplace("^p^p", "<para>", False, False, False)
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = ". ^p"
.Replacement.Text = ". ^p^p"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = "? ^p"
.Replacement.Text = "? ^p^p"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
Selection.HomeKey unit:=wdStory
With Selection.Find
.Text = "! ^p"
.Replacement.Text = "! ^p^p"
.Forward = True
.Wrap = wdFindContinue
.MatchCase = False
.MatchWildcards = False
End With
Selection.Find.Execute
StatusBar = "getting rid of paragraph marks at not sentence final positions"
'This will try to fix line breaks but not at sentence endpoints.
Call DoFindReplace("([0-9a-z]{1}) ^13", "\1 ", True, False, False)
StatusBar = "Replacing based on first letter lower case of paragraph"
'This will try to fix line breaks but not at sentence endpoints.
Call DoFindReplace("^13[0-9a-z]{1})", " \1", False, True, False)
'This will correct this error xaxadx axax? lowercase
Call DoFindReplace("? ([0-9a-z]{1})", "? \1", False, True, False)
Call DoFindReplace(" I ^p", " I ", False, True, False)
Call DoFindReplace(" A ^p", " A ", False, True, False)
Call DoFindReplace(" i ", " I ", False, True, False)
'Lines that end in a comma or semicolon, e.g.
Call DoFindReplace(",^p", ", ", False, False, False)
Call DoFindReplace(", ^p", ", ", False, False, False)
Call DoFindReplace(";^p", "; ", False, False, False)
Call DoFindReplace("; ^p", "; ", False, False, False)
Call DoFindReplace("- ^p", "-", False, False, False)
Call DoFindReplace("-^p", "-", False, False, False)
' General OCR Scanning errors misread letters
Call DoFindReplace("Tbe ", "The ", False, True, False)
Call DoFindReplace(" \h", " W", False, True, False)
Call DoFindReplace("Wliat", "What", False, True, False)
Call DoFindReplace(" \v", " w", False, True, False)
Call DoFindReplace(" r ", "", False, True, False)
Call DoFindReplace(" ot ", " of ", False, True, False)
StatusBar = " Returning <para> to paragraphs"
'Replace Spacing paragraphs.
Call DoFindReplace("<para>", "^p^p", False, False, False)
StatusBar = "replacing punctuation + extra space"
Call DoFindReplace(" :", ":", False, False, False)
Call DoFindReplace(" ;", ";", False, False, False)
Call DoFindReplace(" ,", ",", False, False, False)
Call DoFindReplace(" ?", "?", False, False, False)
Call DoFindReplace(" !", "!", False, False, False)
Call DoFindReplace("( ", "(", False, False, False)
Call DoFindReplace(" )", ")", False, False, False)
' Before running the code below, we need to shorten up spaces before
' periods, because of outlining, like A . B . etc. S .
Call DoFindReplace("([A-Z]{1,2}) .", "\1.", True, False, False)
Call DoFindReplace("([A-Z]{1,2}) .", "\1.", True, False, False)
Call DoFindReplace("([A-Z]{1,2}). ", "\1. ", True, False, False)
Call DoFindReplace("^p^p^p", "^p^p", False, False, False)
'Fixing one paragraph after final punctuation
Call DoFindReplace(". ^13([0-9A-z]{1})", ". ^p^p\1", True, False, False)
Call DoFindReplace("\? ^13([0-9A-z]{1})", "? ^p^p\1", True, False, False)
Call DoFindReplace("!", "<exclam>", False, False, False)
Call DoFindReplace("<exclam> ^p", "<exclam>^p^p", False, False, False)
Call DoFindReplace("<exclam>", "!", False, False, False)
Call DoFindReplace("?", "<ques>", False, False, False)
Call DoFindReplace("<ques>"" ^p", "?"" ^p^p", False, False, False)
Call DoFindReplace("<ques>", "?", False, False, False)
Call DoFindReplace(".", "<perido>", False, False, False)
Call DoFindReplace("<perido>"" ^p", "."" ^p^p", False, False, False)
Call DoFindReplace("<perido>", ".", False, False, False)
Call DoFindReplace("""", "<quot&>", False, False, False)
Call DoFindReplace(" <quot&> ^p", " <quot&>", False, False, False)
Call DoFindReplace("<quot&>", """", False, False, False)
StatusBar = "Getting rid of extra spaces"
'Remove extra spaces
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
Call DoFindReplace(" ", " ", False, False, False)
'Condensing ... and ,,,
Call DoFindReplace(" .", ".", False, False, False)
Call DoFindReplace(" .", ".", False, False, False)
Call DoFindReplace("....", "...", False, False, False)
Call DoFindReplace("....", "...", False, False, False)
Call DoFindReplace("....", "...", False, False, False)
Call DoFindReplace("....", "...", False, False, False)
Call DoFindReplace("....", "...", False, False, False)
Call DoFindReplace(" ,", ",", False, False, False)
Call DoFindReplace(" ,", ",", False, False, False)
Call DoFindReplace(",,", ",", False, False, False)
Call DoFindReplace(",,", ",", False, False, False)
Call DoFindReplace(",,", ",", False, False, False)
'This next section makes sure all paragraphs have no spacing.
Selection.WholeStory
With Selection.ParagraphFormat
.LeftIndent = CentimetersToPoints(0)
.RightIndent = CentimetersToPoints(0)
.SpaceBefore = 0
.SpaceBeforeAuto = False
.SpaceAfter = 0
.SpaceAfterAuto = False
.LineSpacingRule = wdLineSpaceSingle
.Alignment = wdAlignParagraphLeft
.WidowControl = True
.KeepWithNext = False
.KeepTogether = False
.PageBreakBefore = False
.NoLineNumber = False
.Hyphenation = True
.FirstLineIndent = CentimetersToPoints(0)
.OutlineLevel = wdOutlineLevelBodyText
.CharacterUnitLeftIndent = 0
.CharacterUnitRightIndent = 0
.CharacterUnitFirstLineIndent = 0
.LineUnitBefore = 0
.LineUnitAfter = 0
End With
Selection.HomeKey unit:=wdStory
StatusBar = "Clean Text Macro Finished (created by David Cox davidcox.com.mx/eswordmodules/macro_cleantext.htm)"
End SubSub DoFindReplace(FindText As String, _
ReplaceText As String, _
bMatchWildcards As Boolean, _
bCase As Boolean, _
bWholeWord As Boolean)' Parameters: "text", "text", True/False, True/False, True/False,
'FindText - The text to search for
'ReplaceText - The replacement text
'bMatchWildcards - True/False, true if we are using RegEx expressions
'bCase - True/False, True is match case
'bWholeWord - True/False, True Match only whole words.With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
' These two reset the formatting for both search text and
' replacement text as a safety measure
.Text = FindText
.Replacement.Text = ReplaceText
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = bCase
.MatchWholeWord = bWholeWord
.MatchWildcards = bMatchWildcards
.MatchSoundsLike = False
.MatchAllWordForms = False
Do While .Execute
'Keep going until nothing found
.Execute Replace:=wdReplaceAll
Loop
'Free up some memory
ActiveDocument.UndoClear
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
End With
End Sub
(5) Help in making installing the macro, making it useful, easy to get to, etc.
visits since July 14, 2009