====== Extraction Wizard ====== The Extraction Wizard allows you to create Extraction Schemes for capturing data from a source text into the columns of a Database. You can capture data from emails, cloud-systems, PDF files or any other kind of structured data. You are able to define conditions on line, sub-line, word, and/or symbol level for each Database column while the extraction results are shown on-the-fly in the **Test Result** field. The available options cover vastly more than you will need in typical usage. In most cases, you will need only 2 or 3 controls for each Database column. In special, complex situations, you can activate the [[#Use Regex for text fields]] checkbox to inject regular expressions into your settings which open up literally unlimited possibilities. To begin, paste a test string – e.g. a purchase order from which you want to extraction information – into the large **Test String** field on the right side and toy around a bit with the options, always first picking the [[#Data Source]] (except you only want to write a [[#Fixed Text]] into that Database column), and then trimming down the text to be extracted by applying various limitations on line, sub-line, word, and/or character level until the result matches the content you want to store in a Database. To undo a selection in any of the listboxes, simply double-click it. You'll probably get how it works without reading the manual, although you might miss out on some less obvious functionalities and tricks. A detailed description of each control follows below. Alternatively, skip directly to the [[#Examples]] section to gain intuitive understanding. To open the Extraction Wizard: ⯈ Right-click into the **Schemes** listbox on the **Home** tab and select **Create new** from the context menu. The Auto Book Dataviewer will open and show the available Database column headers. Each row represents one set of column headers. If no row contains the headers you want, add a new row and type your own headers or modify the existing ones. Don't forget to click **Save** if you want to keep your changes for later use. When you are done, select the row of headers you want to use and click the **Select** button at the right end of the bottom ribbon. Note that the headers only have descriptive purpose. That is, Extraction Schemes with any headers are compatible with all Databases, no matter if their column headers are different. So your choice of headers isn't of critical importance, but it makes sense to give them names that make clear what you want to capture in each column. If for now you just want to try it out, just choose any set of headers. The Extraction Wizard will open in //Single Column View// mode. ===== - The Single Column View ===== The Single Column View of the Extraction Wizard shows the controls for configuring a single Extraction Scheme column. The text captured by applying the settings of the Extraction Scheme's first column will be inserted into the first Database column, and so on. The names of the Extraction Scheme and Database columns won't influence this, that is, if your first Extraction Scheme column is called "Date", while "Date" is the second column of the Database, the captured date data will still be added to the first Database column. The advantage of this approach is that all Extraction Schemes remain compatible with all Databases, independently of column names. The interface may look complex, but don't worry. It's not difficult. ==== - Overview ==== The screen is divided into a main part, where you define which text should be extracted, and a right-side panel which shows the results of your settings on-the-fly when applied to a test string. Be sure to paste a sample text into the **Test String** field and confirm that your settings really yield the result you expect. [{{ :autobook-v1.1-extractionwizard-singleview.png?nolink |Extraction Wizard showing controls for column "Date"}}] \\ === - The Main Part === Centrally on top of the main part, the current column number and its title are shown. In this example, "1: Date". Use the **Back** and **Next** button at the lower end of the screen to switch between columns. The controls in the main part are structured into 6 groups: [[#Data Source]], [[#Line Limitations]], [[#Sub-Line Limitations]], [[#Word Limitations]], [[#Symbol Limitations]] and [[#Fixed Text]]. Additionally, there are two checkboxes and the [[#Trim characters field]] in the top left area. If you click any of the controls, the controls that are incompatible with the clicked control are automatically disabled (grayed out). If you **double-click** a selected control, the control will be cleared and any deactivated incompatible controls will be re-enabled. To clear all controls of the current column, click the **Reset** button. If your settings are incomplete or inconsistent, the **View Errors** button becomes activated. Click on it to see information on how to fix your settings. Always choose a [[#Data Source]] if you want to capture any received text; that is, only leave [[#Data Source]] empty if you want to leave the corresponding Database column blank or only enter [[#Fixed Text]]. After picking the [[#Data Source]], you can define in which your line the text to be extracted is found via the [[#Line Limitations]] group. You can further refine your settings via [[#Sub-Line Limitations]], which allow you to pick only a certain part of a line, [[#Word Limitations]], which are used to pick only words fulfilling certain conditions, and [[#Symbol Limitations]], which limit the type of symbols (characters) that will be included in the extraction result. None of these four groups are mandatory, that is, you can, for example, only use [[#Word Limitations]] without defining the line in which the text to be extracted is found, and so on. The basic principle is the same as when running a search using multiple search conditions. Each application of any type of limitations narrows down the extraction result that Auto Book will return. Apply as many limitations as are necessary to define the text you want to extract and store in a Database. Finally, regardless of your other settings, you always have the option to insert [[#Fixed Text]] (including dynamic components such as the current date/time) into the current Database column, which, if you have used any of the other controls to extract text, can be placed before or after the extracted text. See below for detailed explanations. Use the **View All** button to switch to the **All Columns View**, a condensed overview showing your settings for all 10 columns. Here, all the controls of the Single Column View are available as well (apart from the right-side panel), but for space reasons, some explanations are missing. You can return from the All Columns View to the Single Column View at any time (for details, see [[#All Columns View]]). When you're done with your settings for all columns – up to 10, but you don't need to use all – click the **Save** button to create your Extraction Scheme. You will be able to reopen your settings and make adjustments later via the **Load** button. To exit the Extraction Wizard, click, as you might have guessed, **Exit** (you will be prompted for unsaved changes). === - The Right-Side Panel === The right-side panel is used for testing the inputs you've made in [[#the main part]]. Paste a sample text -- e.g. an email containing the data you want to extract -- into the **Test String** field of the right-side panel. While you make selections or enter text, your input is constantly evaluated and the result shown in the **Test Result** field in the right-side panel, if no mandatory settings are missing. That is, the **Test Result** field shows which data would be extracted from your **Test String** when the settings you have made are applied. If you activate the **Process Email Source** checkbox, your test string is assumed to be an email source, and it will be pre-processed in the background to extract the email's text body, subject, sent date, sender name, and sender email address. In this case, the **Test Result** field will show the result of applying your settings to the content of the selected [[#Data Source]] instead of the whole test string. For example, if you select (3) Date as [[#Data Source]], your extraction settings will be applied to the email's sent date as indicated in the email header. To see the content of each [[#Data Source]], simply click each source in the listbox without making any other selections while the **Process Email Source** checkbox is activated. The **Regular Expression** and **RegEx Mode** fields will also be updated automatically whenever you make selections or enter text. They show the beginning of the RegEx patterns used internally to extract your data or the full patterns with Monster Regex License. If you don't know what RegEx is, don't worry about it. These fields are for informational purposes only -- for users who want to understand the inner workings of Auto Book or use the RegEx patterns for other purposes than Auto Book. It's fine to ignore them, as the Wizard will handle everything for you. === - The Email Interpreter === The Email Interpreter is opened via the button in the top right area of the Extraction Wizard's [[#the main part|main part]]. It allows you to see the result of decoding an email source in Auto Book. This functionality is intended for troubleshooting, and you probably will never need it. If you're just reading the manual to learn general usage, skip ahead to the next section for now. Auto Book can decode all common email encoding schemes such as Quoted-Printable, BASE64, etc., and also parses HTML. When you press Auto Book's [[manual#Hotkeys Tab|Email Source Extraction hotkey]], the selected text or clipboard's content is assumed to be an email source and decoded internally. Your Extraction Schemes are then applied to this decoded content instead of the raw text you are seeing on your screen. The decoding results can also be viewed by pasting the email source into the **Test String** field and clicking the **Process Email Source** checkbox in [[#the right-side panel]], and then selecting a [[#data source]] – the only difference is that the Email Interpreter will show the decoding result of the whole email, including both the header and the text body. After you click the **Email Interpreter** button, an empty window will open. Don't copy anything into this window. Rather, open the email source you want to decode in your email client and copy it into the clipboard by pressing CTRL+C. The decoded text will then, after about half a second, automatically pop up in the Email Interpreter window. The whole email should now be displayed as plain text – if not, the email's encoding is not supported by Auto Book or was not recognized. ==== - Extraction Settings ==== === - General Settings === == - Case-sensitive checkbox == If this option is activated, matching for text entered into any of the text input fields is case-sensitive. That is, if you enter PO into the **must contain** field of the **Word Limitations** group, for example, PO123 will be found, but not Po123. == - Use Regex for text fields checkbox == If this option is activated, strings entered into any of the text input fields of the Extraction Wizard (except the [symbols] field and the Fixed Text fields) are interpreted as Regular Expressions instead of plain text. If you don't know what Regular Expressions are, just ignore this section for now. If you do know how to use Regular Expressions, this option gives you additional flexibility in rare cases where the standard options below don't provide what you need. All of PCRE Regex is supported. Thus, for example, you can require that a word includes any single digit by typing \d into the **must include** field of the **Word Limitations** group. You can also use Lookaheads, Lookbehinds, anchors such as ^ or $, character classes, and so on. There are plenty of tutorials online for learning Regular Expressions, if you are keen to do so. An extremely helpful site for testing your patterns is [[https://regex101.com/]]. Finally, here is a tiny overview over the most useful expressions: |.|any symbol except linefeeds| |\d|any digit| |[a-zA-Z]|any English letter| |\pL|any letter from any language| |?|makes the preceding symbol optional| |*|allows any number of the preceding symbol, as many as possible| |+|allows any number of the preceding symbol, as many as possible, at least one| |+?|allows any number of the preceding symbol, as many as necessary, at least one| |{2}|exactly two of the preceding symbol| |{2,5}|between two and five of the preceding symbol| |{2,}|more than two of the preceding symbol| |{,2}|less than two of the preceding symbol| |%%|%%|or| |%%^%%|beginning of a line| |$|end of a line| |(?=...)|Lookahead| |(?!...)|Negative Lookahead| |(?<%%=%%...)|Lookbehind| |(?\) when used literally: \, ^, $, ., |, ?, *, +, (), [], {} ** IMPORTANT NOTE:** If you use any brackets in your RegEx pattern, make sure to include ?: as first symbols after the opening bracket to make it non-capturing, if it is not non-capturing by default. Otherwise, you will break Auto Book's algorithm. Thus, to find find A followed by X or Y, for example, use A(?:X|Y). == - Trim characters field == Enter any characters into this field that you want to be removed from the beginning and end of the extraction result. For example, if you extractions settings sometimes return a "," at the end of the extracted word which you don't want, type , into this field. Spaces and tabs are always automatically trimmed so that you don't need need to type them here. === - Data Source === [{{ :autobook-v1.1-extractionwizard-singleview-datasource.png?nolink|Data Source selection}}] The choice of the data source is important only if you are are going to use [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]]. For [[manual#Normal Text Extraction]], your selection won't make a difference, because you are using only one piece of text for data extraction (the text within the clipboard); in this case, click either **(6) Clipboard/General** or any of the other options in case you want to make your Extraction Scheme compatible with other methods of data transmission. Otherwise, pick the source from which you want to extract the data for this Database column: (1) Date: The email date as indicated by your email client.\\ (2) Sender: The email sender name as indicated by your email client.\\ (3) Subject: The email subject as indicated by your email client.\\ (4) Address: The email address of the sender of the email as indicated by your email client.\\ (5) Text: The text body of the email as indicated by your email client.\\ (6) Clipboard/General: Select this option only if //not// using [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]].\\ If you're going to process the email source, these sources – Date, Sender, Subject, Address and Text (body of the email) are automatically extracted from the email source, and your Extraction Scheme settings will be applied onto these resulting sources. Thus, if you are going to extract a part of an email's subject line, for example, you don't need to worry about capturing the subject line from the email source, but only need to make the settings to define which part of this single line you need. As another example, if you select Date as your source for a certain column, you won't need to make any other settings if you're happy to capture the whole date as per email source into your Database – selecting a source without any other limiting settings means you are going to keep the whole source. === - Line Limitations === **Line limitations** allow you to define from which lines you want to extract text. If you have no need to do so – if, for example, you are able to capture to desired content merely by placing conditions on the words to be extracted under **Word-Limitations**, or obviously, if your source text contains only a single line of text (such as an email's subject line, etc.), simply leave them empty. [{{ exwiz-singleview-linelimitations.png ?nolink |Line Limitations}}] \\ If you are only using **Line limitations** but no other limiting settings, the first matching line will be inserted into your Database as a whole. == - Defining lines to capture by explicitly indicating their position == The left and middle block of the **Line Limitations** group (the controls below **The line to capture is the:** and ** where the specific text consists of**) allow you to define a line or several lines in which the text to be extracted is found by explicitly indicating its/their position relative to either **the beginning of the source text** or a line **where a specific text** occurs. This is done by entering the line number or range of numbers into the **[N]** field. The initial line – the beginning of the source text or the line where the specific text occurs – is considered line 1, the next line is line 2, and so on. To define a range of line numbers, enter, for example, 2-5 or 1 for lines from the 2nd to the 5th, <5 for lines up to the 4th, >1 for all following lines, and so on. Negative line numbers are currently not supported, but might be implemented in a later Auto Book version. When indicating a range of lines, note that at most one line will be inserted into your Database. If you don't make any other limiting settings (e.g. via [[#Word Limitations]]), the first line of your range will be captured. If you do make other limiting settings, the first line with content that fulfills all other conditions as well will be captured. For example, if you stipulate that captured words must include the letters EUR, the first line of your range where such a word appears will be captured. To count only lines that include some content (anything else than whitespace), click **N-th non-empty line from** in the top field. Otherwise, to count all lines, click **N-th line from**. If you opt to define the line(s) relative to a specific text, the control under **where the specific text consist of** is automatically pre-set to **anything**, which means that you don't want to place any further restrictions on the specific text. If you do want to place further restrictions, click on the relevant option. Examples: - If the text to be extracted is known to be in the very first line of the source text, pick **the beginning of the source text** in the second control field and enter 1 into the **[N]** field. - If the text to be extracted is found in a line below the word Remuneration, click **where a specific text occurs**, enter 2 into the **[N]** field – since the line where Remuneration appears is considered line 1 – and enter Remuneration into the **[Specific Text]** field. By indicating a range of lines or counting only non-empty lines, you are able to implement some tolerance in case the data you receive is not always strictly following a certain layout. == - Other ways of limiting the line to capture == The controls under **The line to capture:** give you some additional options for defining which line in the source text should be captured. That is, you can require that the desired line includes a certain text string (**includes** field) or doesn't include a certain text string (**doesn't include**). Also, you can require that a certain text string appears in the following line (**is before a line including**). Note that any or all of these options can be combined with the controls to the left for explicitly indicating the line position. That is, you could be looking for a line within the first 10 lines of the text including the word USD, for example. === - Sub-Line Limitations === Sub-line limitations allow you to capture a part of a line from after a user-defined string ("from" string) until before another user-defined string ("to" string). [{{ autobook-v1.1-extractionwizard-singleview-sublinelimitations.png?nolink|Sub-Line Limitations}}] - Both the "from" string and the "to" string are optional. If the "from" string is not filled in, the line will be extracted from its beginning until the "to" string. Correspondingly, if the the "to" string is missing, the line will be extracted from after the "from" string until the end of the line. - If a string occurs multiple times in a line, the occurrence of the string that causes that maximizes the length of the extracted line is used. That is, in case of the "from" string, the first occurrence is used, and in case of the "to" string, the last occurrence is used. Examples will make clear that the idea is really very simple: >> username@domain.com ⯈ To capture username, enter @ into the **to** field.\\ ⯈ To capture domain, enter @ into the **from** field and . into the "to" field.\\ ⯈ To capture com, enter . into the "from" field. (This approach obviously won't work if there are dots in the username as well. In such as case, you could activate [[#Use Regex for text fields checkbox|Use Regex]] and enter @.*\. into the "from" field.)\\ ⯈ To capture domain.com, enter @ into the **from** field. >> PO: XY123456 ⯈ To capture XY123456, enter : or PO: (a trailing space is optional). By entering PO, you make sure that only lines including PO: will be captured, if you haven't defined the line via [[#Line Limitations]]. Otherwise, the first line including : will be used. (This example is identical to using the [[manual#standard_schemes|Standard Format]] in case PO is also the column title.) == - Use Standard Format checkbox == To extract text based on the [[manual#standard_schemes|Standard Format]], activate the **Use Standard Format** checkbox. It's in this group of controls because it is, in effect, a kind of pre-defined sub-line limitation. This means that you are going to extract the line part following the column header and a colon, such as 2022-12-31 from the line Date: 2022-12-31 if Date is the column header. As this is a pre-defined complete column configuration, all other controls will be disabled, except [[#Fixed Text]], which you still can add before or after the extracted text. === - Word Limitations === Word limitations allow you to limit the extraction to words fulfilling certain conditions, e.g. that words that begin or end with certain strings or include or don't include certain strings or are preceded or followed by certain strings. The 6 fields with which to apply these conditions are hopefully self-explanatory. 💡 A "word" in Auto Book refers to a string of arbitrary symbols separated from other text by whitespace (spaces, tabs, line-breaks). That is, PO123456 and 123.45 are both considered single "words". (An option to set your own word-boundary conditions might be implemented in a later Auto Book version.) [{{ autobook-v1.1-extractionwizard-singleview-wordlimitations.png?nolink |Word Limitations}}] \\ These conditions can be freely combined with any of the other controls of the Extraction Wizard except [[#Sub-Line Limitations]]. That is, you can look for these words only in certain lines by using [[#Line Limitations]] or restrict the letters extracted from these words with [[#Symbol Limitations]]. In addition, all of these 6 fields accept the logical operators **AND** and **OR**. Thus, to find words that include both PO and 2022, enter PO AND 2022, and so on. Note that these operators must be written in CAPITAL LETTERS. (They can be escaped with a backslash: \AND is literal AND and \OR is literal OR.) You can use any number of ANDs or ORs in a text field, but, in Auto Book 1.1, not a mixture of ANDs as well as ORs. If you know Regex, enabling the **Use Regex for text fields** checkbox opens up a broad range of additional possibilities. As a simple example, you could enter \b\d+\.\d\d\b into the **must include** field to capture pure numbers with exactly 2 decimals. More complex constructions are also possible – in principle, all of PCRE Regex is supported. Furthermore, you can define the words to be extracted by explicitly indicating their position within a line with the two bottom right fields of the **Word Limitations** group. For example, to capture the first 3 words of a line: ⯈ Enter 1-3 into the field **Selected words (counting from beginning of line)** Or to capture the 4th last and 2nd last word of a line: ⯈ Enter 2, 4 into the field **Selected words (counting from end of line)** These two fields cannot be combined with each other – you have to count either from the beginning of from the end of the line. Furthermore, these two fields can //only// be combined with [[#Line Limitations]] and [[#Fixed Text]]. The other groups will be deactivated as soon as you type something in one of these fields. Typically, you will use [[#Line Limitations]] to define the line to operate on and then use the **Selected words** fields to pick certain words. If you do not use [[#Line Limitations]], the first line that has at least as many words as are necessary to fulfill your selection will be used – for example, if you specify words 1-3, lines with only 2 words will be skipped and the first line with at least 3 words will be used. === - Symbol Limitations === == - Predefined Symbol Limitations == [{{ :autobook-v1.1-extractionwizard-singleview-symbollimitations.png?nolink|Symbol Limitations}}] While the [[#Line Limitations]] group of controls allows you to limit the extracted text to a specific line and [[#Word Limitations]] set conditions on word-level, **Symbol Limitations** allow you to further limit or narrow down the extraction to certain symbols. Click on one of the four predefined sets of character classes -- **only letters**, **only digits**, **only digits, dot, comma**, or **only letters + digits** -- to allow only symbols of that type. These limitations work at character level, i.e. if you limit the expression "PO123465" to "only digits", only "123456" will be extracted. == - Custom Symbol Limitations == Selecting the **Custom** option of the **Symbol Limitations** listbox allows you to define your own limitations in the **[Symbols]** field below. Enter any single character or any expression that would be permissible within square brackets in PCRE Regex, without the square brackets themselves. That is, entering a-zA-Z would allow all letters of the English alphabet, for example, and \-+0-9.,$€£ would allow plus and minus signs, digits, dots, commas, and $, €, and £ characters. If you use the negation ^ as the very first character, you switch to exclusion mode and all following characters you enter will be //disallowed// instead of allowed. That is, ^0-9 would allow anything //except // digits. Note that there are four symbols that must be escaped if you intend to include or exclude them because they are part of the Regex syntax. These four symbols are -, ^, ], and /. To escape these, put a \ in front. Thus, to allow -, for example, enter \-; to allow anything // except// \, enter ^\\; to allow ^, enter \^, and so on. The following tables shows some more examples on how to define character classes in Regex. Remember that in each case, the [] should not be entered into the **[Symbols]** fields. For further information, use a search engine to search for Regex character classes. |**One way to do it**|**Another Way**|**Characters allowed**| |[\pL]| |Any kind of letter from any language| |[\w]|[A-Za-z0-9_]|Word characters| |[\W]|[%%^%%A-Za-z0-9_]|Non-word characters| |[:alnum:]|[A-Za-z0-9]|Alphanumeric characters | |[:alpha:]|[A-Za-z]|Alphabetic characters | |[:digit:]|[0-9]|Digits | | |[%%^%%0-9]|Non-digits| |[:lower:]|[a-z]|Lowercase letters | |[:punct:]|%%[!"#$%&'()*+,./:;<=>?@\^_`{|}~-]%%|Punctuation characters| |[:upper:]|[A-Z]|Uppercase letters | === - Fixed Text === [{{ :autobook-v1.1-extractionwizard-singleview-fixedtext.png?nolink|Fixed Text fields}}] The two fields in the **Fixed Text** group allow you to add fixed text, ie. text independent of the source text, in front of or behind the extracted text. If you want to store ONLY fixed text for this Database column, simply leave all other fields empty. In this case, it doesn't matter whether you use the **Put this text in front** or **Put this text behind** field, as there is nothing to put it in front of or behind. The **Fixed Text** fields also accept a few commands that make it somewhat dynamic - it's called "fixed" because it doesn't depend on the source text. Simply enter each command including the tags <> into either one of the **Fixed Text** fields. When saving data to a Database, these commands will be automatically replaced as detailed below: ||Will be replaced with the folder path generated from the [[manual#Auto Folder]] pattern saved with this Extraction Scheme.| |