Microbot (Micro + robot)

A simple JAVA bot to grab web resources with lots of configuration (anonimity, user agent, wait before requests, auto compression, and etc). Mainly useful when you want to grab information from a web site. IT IS NOT A CRAWLER, it is just a bot. You should feed the bot with links (URLs) and it will fetch and store them.

Abilities

The Microbot uses a configuration file which can configure following abilities:

  • Fetching multi links: It will handle cookies and referer for multi requests.
  • User Agents (UA): it will select random UA which the UA list is given by the user.
  • Using proxy: mainly SOCKS for networks such as TOR.
  • Resume: If user terminates the program, it will store the remaining links in a file named Links.csv
  • Auto compressor: Specify limit for output folder, after threshold reached, it will automatically compress the files into a compressed file.
  • Logs: It will store logs, so the user can handle errors (Ex: http response code is 4xx or 5xx or etc.)

Files and formats

In the following, a tutorial is described to show the configuration. You may find sample user Agents and sample configuration file in sample Files directory

There are several topics configurations:

  • Anonymizer
  • Web Requests
  • Load and Store
  • Logging

The format for description: NAME: {AVAILABLE OPTIONS} ~ Description

Anonymizer

There are four variables in anonymizer section:

  • Anonymizer: {TOR, I2P} ~ The type of anonymizer network if used. The configuration is useful if you used proxy along it.

  • AnonymizerProxyType: {SOCKS, DIRECT, HTTP, NONE} ~ The type of proxy. If you want to use it directly, use NONE proxy. Mostly TOR will use SOCKS proxy.

  • AnonymizerIP: {ANY valid IPv4} ~ The IP address of the proxy. TOR uses LOCALHOST (127.0.0.1) in most of the cases.

  • AnonymizerPort: {ANY valid port} ~ The port number of the proxy. TOR uses 9050 or 9150 in most of the cases.

CAUTION: Microbot just uses anonymizer network as proxy. You may use other tools or scripts to force TOR or any other anonymizer network to change the IP of the request. (Ex: a script to send HUP signal to TOR process each 100 seconds).

Web Requests

The variables in web requests are as follows:

  • AcceptSelfSignedCertificates: {true, false} ~ If true, Microbot will accept invalid or self signed certificates (https).

  • ThreadCount: {ANY positive integer -> 1, 2, 3, ...} ~ The number of threads for speedup fetching.

  • MinSleep: {ANY positive integer -> 1, 2, 3, ...} ~ The minimum amount of time (in seconds) that each thread will wait after request a new resource.

  • MaxSleep: {ANY positive integer larger than MinSleep -> 1, 2, 3, ...} ~ The maximum amount of time (in seconds) that each thread will wait after request a new resource.

  • UserAgentListFile: {Valid text file path} ~ The file has simple format. Each link is in a new line. For compatibilit the new line should use WINDOWS format new line (\r\n). There is a file named UA.txt that has list of UAs. You may use your UA list.

  • RandomUserAgent: {true, false} ~ If true, it will use a random user agent from UserAgentListFile, if false, it will use the first UA (line) in UserAgentListFile.

  • Cookie: {Any valid String in cookie format} ~ If you want to use a fixed cookie for all the requests, fill this parameter. Actually this field may not be used in most of the cases. You can leave it blank.

Load and Store

The variables involved in loading and storing are as follows:

  • InputFile: {Valid text file path in CSV format} ~ The input file which consists two main information: Name of output file, URL. Following is a sample format

NAME,URL

page1.html,http://foo.bar/page1.php

page2.html,http://foo.bar/page2.php~~~http://foo.bar/page2.php?id=Hello~~~http://foo.bar/page2.php?id=World

page3.html,http://foo.bar/page3.php

page4.html,http://foo.bar/page4.php

Page 1, 3, 4 are simple examples. Page 2 is referer based example which links are separated using ~~~ or any valid user defined separator.

  • MainURLColumnName: {URL COLUMN Name in InputFile} ~ In the above sample there are two columns: NAME and URL. URL is the MainURLColumnName.

  • OutputFileColumnName: {Output file name column in InputFile} ~ In the above sample there are two columns: NAME and URL. NAME is the OutputFileColumnName.

  • LinksSeparator: {ANY valid string character sequence} ~ In the above sample (second line) uses ~~~ as the separator for referer based requests.

  • OutputDirectory: {ANY valid folder path which user has permission for writing in it} ~ All the HTML output will be stored here with the name indicated in OutputFileColumnName.

  • OutputLimit: {B, KB, MG, GB, TB} ~ The threshold when Microbot should start to compress retrieved documents. (Ex: 2MB or 300MB)

  • DeleteAfterCompress: {true, false} ~ Indicates if Microbot should delete retrieved files after successfully compress them.

  • CompressorType: {ZIP, TAR, RAR, GZIP} ~ The type of output archive. Currently only ZIP format is implemented. Other options will cause an exception.

Logging

The logging parameters are very simple.

  • Debug: {true, false} ~ Will print debug information in standard system output.

  • VeryVerbos: {true, false} ~ Will print debug information in very verbose mode. Debug mode should be enabled.