Email Address Extraction and Validation, Mailing List "Cleaning"
Email Address Extraction and Validation (Extract And Clean Email Addresses)
This tool allows you to extract email addresses from text files and validate their syntactic correctness.
An email address is considered syntactically valid if it meets the following requirements:
- Contains only Latin letters (a-z), digits (0-9), hyphen (-), underscore (_), period (.), and exactly one "@" symbol.
- Starts with a letter or digit.
- Does not exceed the maximum length of 45 characters (this value can be changed, see Additional Checks).
- Contains at least one period.
- Must contain at least one character before the period and at least one character after it.
- The email address must end with a Latin letter (a-z).
- The username length (the part before the "@" symbol) must be at least 2 characters.
- The domain (the part after the "@" symbol) must not contain a hyphen.
Additional (Optional) Checks
- Reject any addresses longer than the specified value (Reject any addresses longer than N).
- Allow embedded spaces in AOL usernames.
- Remove duplicate domains (No duplicate domains). In other words, the output file will contain no more than one email address per domain.
- Reject email addresses containing 3 or more periods in country domains, and 2 or more periods in all other domains (Reject non-country domains with 2 or more dots and country domains with 3 or more dots). The list of country domains can be edited.
- Reject domains that start with numbers.
- Reject invalid top-level domains (extract email top-level domains).
- Reject email addresses containing only digits.
- Reject addresses that match a regular expression. For example, the following regular expression filters out all email addresses containing 3 or more repeating characters:
(.)\1{2}
Pre-processing
- Convert OEM to ANSI. This setting changes the encoding of input files from OEM to ANSI before processing.
- Skip Characters. You can define a list of allowed characters in the input file; all other characters will be ignored. In some cases, this helps process binary or "corrupted" files containing invalid characters (for example, a binary zero). Example:
a-zA-Z0-9`!@#$%^&*()_+|\-=\\{}\[\]:";'<>?,./
Output Files
- Output File — a text file containing valid email addresses.
- Rejected File — a text file containing rejected (failed validation) email addresses.
Output File Sorting
You can enable sorting for the output file (Sort). Sorting options:
- Remove Duplicates.
- Sort By Domain.
- Remove domains that contain no more than a specified number N of email addresses. Removed email addresses can be saved to a file (Save removed emails to file).
Additional Settings
- You can append a column containing the input filename to the output file (Append Filename column). Separator: tab character (TAB) or comma (COMMA).
Mailing List "Cleaning" (Clean Mail Lists)
Unlike the email extraction mode (Extract Emails), which works with unstructured text data, Clean Mail Lists is intended to normalize mailing lists into a common ("canonical") format.
To do this, enable Multi Column Support and define rules for reorganizing and formatting the data.
On the General tab:
- Replace column delimiters with a tab character (Replace delimiters by TAB).
- Replace column delimiters with a comma (Replace delimiters by COMMA).
- Remove quotes.
- Remove leading and trailing spaces from fields.
- Move email addresses to the first column.
- Remove empty fields. Example:
,;: - Limit the number of output columns (Output columns).
- Define custom column delimiters (Custom delimiters).
On the Format tab:
- Convert dates to the format defined in the system Regional Settings (Convert dates to system format). You need to specify the column numbers containing dates, for example:
10,11(comma-separated). - Capitalize First Letters. You need to specify the column numbers for which this action should be applied.
- Convert column text to uppercase (Uppercase). You need to specify the relevant column numbers.
- Convert column text to lowercase (Lowercase). You need to specify the relevant column numbers.
On the Reorder/Remove Fields tab, you can choose which columns to output and in what order.