RegExp Extractor

1.12 released 2016-12-27
for 32-bit Windows   Download

RegExp Extractor is an utility designed to extract various data from text files and logs using regular conditional expressions and rules.

It is very fast and can process huge files.

To use this program you need to know regular expressions.

Free for private, non-commercial use.

If you like RegExp Extractor you may like to donate a small amount to help me keep developing and updating.

Bitcoin address: Donate1DfVTu6DCrKzPN6Yqzr3vWJDe7YH3yj22W  Copy

To learn more about Bitcoins, visit the website (https://bitcoin.org) or read more on Wikipedia.

RegExp Extractor

Source file(s)

File that you want to extract data from. You can use mask here, ex. c:\temp\*.txt

Output file

Output file for the extracted data.

Output dir

RegExp Extractor can produce several output files. This option allows to define destination folder for them.

Save other lines to file

Save the lines, that don't match any regular expression, here.

Conditions/Rules Tabs

Each tab contains set of conditions and rules to extract data.
For ex., "emails" tab contains conditions and rules to extract emails, "url-domains" tab contains conditions and rules to extract domains from the urls.

To add a new tab use [+] button below the tabs. To remove an existing tab use [-] button.

When you press "Start" button, RegExp Extractor will extract data from source file(s) using conditions and rules from the active tab.

Each set of conditions & rules has the Title (name of the tab).

You can check "Sort / Dedup output files" option to sort and remove duplicate lines.

Extract Conditions

Each line contains the regular expression with the name. Conditions are used in Extract Rules.

Extract Rules

Each line contains the rule - what data to extract.

Example 1.

Condition: email=/[a-z0-9][a-z0-9.-]+[a-z0-9]@[a-z0-9][a-z0-9.-]+[a-z0-9]/
Rule: email:$0

In this example: email is the name of the used condition.
/[a-z0-9][a-z0-9.-]+[a-z0-9]@[a-z0-9][a-z0-9.-]+[a-z0-9]/ is the regular expression to extract emails.
$0 specifies that we need to extract all sub-strings of the source line, that match the condition. For our example it is email.

Example 2.

Condition: url-domain=/https?://([a-z0-9][a-z0-9.-]+[a-z0-9])|(www\.[a-z0-9.-]+[a-z0-9])/i
Rule: url-domain:$1$2

url-domain is the name of the used condition.
/https?://([a-z0-9][a-z0-9.-]+[a-z0-9])|(www\.[a-z0-9.-]+[a-z0-9])/i is the regular expression to extract urls.
$1$2 specifies to extract the first ([a-z0-9][a-z0-9.-]+[a-z0-9]) and the second (www\.[a-z0-9.-]+[a-z0-9]) groups from sub-strings that match regular expression.

Also you can use another characters in the rules to produce result lines, for ex.: email:The email is $0
Result lines will look like this:

The email is email1@domain1.com
The email is email2@domain2.com

Separate by conditions

This option allows you to save lines that match different conditions into different files in the output folder.
See Example below.

Example 3.

Separate by conditions: On

Conditions:

sent-ok=/sent ok/i
blocked=/blocked/i
http=/(https?://)|(www\.)[a-z0-9.-]+//
err=/(-ERR \[[0-9]{3}\] : ).+ : (.+)/>

Rules:

sent-ok!:$L
blocked!^err:$L
http!^err:$L
err!:$L

This example demonstrates how to save all lines that have "sent ok" sub-sting to sent-ok.txt,
lines that have "blocked" sub-string AND don't match the err condition to blocked.txt,
lines that have urls (that match "http" condition) AND don't match the err condition to http.txt,
lines that match err condition to err.txt.

Sign ! after the name of the condition in the rule expression means that RegExp Extractor will stop processing rules if the line matches the condition from this rule. If we omit ! in our example then RegExp Extractor will save the string "sent ok: blocked" to the both files: sent-ok.txt and blocked.txt.

Sign ^ in blocked!^err means that the line should match the condition blocked and match the condition err.

Also you can use sign ~ that means that the line SHOULD NOT match the condition after that sign.
Example: blocked!~sent-ok:$L. In this example$L means that we need to extract the WHOLE line. Not only the sub-string that matches the regular expression.

Example 4.

Separate by conditions: Off
Condition: http=/https?://([a-z0-9][a-z0-9.-]+[a-z0-9])|(www\.[a-z0-9.-]+[a-z0-9])/i
Rule: http>>$1$2.txt:$L

In this rule we specified the output file name $1$2.txt where $1$2 is the domain of the extracted url.
This example demonstrates how to separate lines by domain.