Indigo DQM Data Management System Indigo DQM
Web Scraping and Web Crawling Indigo DQM Help

Indigo DQM Data Web Scraper and Web Crawler can be used to extract and process Data from HTML Web Pages. Web Scraping is used for content scraping and data extraction, and as a component of applications used for web indexing, data mining, Email and Link extraction, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration. Web Crawling is the process of iteratively finding and fetching web links from a website.

Using the Web Page XML Data Processed by Indigo DQM the Indigo DRS Reporting Engine can create extremely powerful reports for business intelligence on competitors Websites.

Indigo DQM can harvest a single URL or an entire websites content using the web crawler and then use XQuery to mine data from it, for example products, prices etc. contacts delivering you real business intelligence.

Select the Data Source Tab from the Indigo DQM Data Management Studio. Shared Sources can be used by all Data Command Queries and Execution Plans in the system.

Data Folders

Create a Data Folder for the Data Source to organise it in the Data Store.

Web Scraping HTML Websites / Webpages with the Web Scraper Single Data Provider

The Web Scraper Data Source is a HTML Web Page used for Data Mining / Extraction.

Enter a Name and Description for the Data Source and Select a Data Folder.

Selecting a Web Scraping Data Source

Select Web Scraper Single as the Data Source Type from the dropdown list and enter the Location for the Web Page Data Source. The Data Source is a Website or Webpage for Web Scraping HTML and converting to XML for Data Mining, Querying and Reporting etc.

To Show the Data Connection String Properties Click the Properties button at the bottom of the Connection Dialog. This will display the Advanced Properties for the Connection String.

Enabling the Web Crawler

The Web crawler is an Internet bot which systematically and automatically browses the World Wide Web, typically for the purpose of Web indexing (web spidering) and Web Scraping.

Enable the Web Crawler and Select the Pages the maximum number of pages and depth you wish to Crawl.

The Crawl Action is set to 'Content' by default which will download all the Web Page Content as HTML. The Crawl Action can also be set to 'Links', 'Emails' and 'MetaData' which will just retrieve the Web Page Links, Emails and MetaData accordingly.

Click OK to Save the Data Source.

Web Scraping Multiple Websites / Webpages using the Web Scraper Multi Data Provider

The Multi Web Scraper can Web Scrape and Crawl multiple Websites and Webpages specified in a File List of URLs.

Create a text file of the URLs of the Websites / Webpages to Data Scrape and Save. The Data Scraper Multi Data Provider will Read from each line in the File the URL and convert the HTML content to XML for Querying and Data Mining.

The File Location can be either a local file or a remote file at a designated URL.

Click the Properties button to specify the Properties for the Multi Web Scraper Data Provider.

If the Max URL's is set to Zero then there is no limit to the amount of URL's that are read from the File. The starting position in the File of the URL's to crawl can also be specified. If zero then the File will be read from the beginning.

Enabling the Web Crawler

The Web crawler is an Internet bot which systematically and automatically browses the World Wide Web, typically for the purpose of Web indexing (web spidering) and Web Scraping.

Enable the Web Crawler and Select the Pages the maximum number of pages and depth you wish to Crawl.

The Crawl Action is set to 'Content' by default which will download all the Web Page Content as HTML. The Crawl Action can also be set to 'Links', 'Emails' and 'MetaData' which will just retrieve the Web Page Links, Emails and MetaData accordingly.

Crawler Configuration File

The Web Crawler can be configured using an XML File. Copy the Example below to a Directory and point the Web Crawler Config File in the Data Scraper to the new File. Change the Parameters in the File as required.

Example Web Crawler Configuration

Web Crawler Filter and Filter Files

The Web Crawler Filter can be used to include or exclude URL's in the Index.

Enter a Filter term and select the Filter action to either include or exclude any Web Pages in the Crawl containing the Filter term in the URL.

The Crawl Filter is a comma separated list of terms to Filter. The File Filter is a location to a File list of Filter terms. In the File Filter each new Filter term starts on a new line.

The Link Filter similarly includes or excludes Links in the Link Extract for each Web Page Crawled.

Connect to the Web Page HTML Data Source

Connect to the HTML Web Page using the XQuery Designer to the View as XML Data for Scraping, Query and Data Extraction.

Extracting the Web Page Title using XQuery

Executing an XQuery statement for this Web Scrape to Extract the Web Page Title.

Executing an XQuery statement for this Web Scrape to Extract the Web Page Keywords.

Indigo DRS Report Designer for advanced Web Scraping and Reporting

The Indigo DRS Report Designer can be used to create advanced Reports and Outputs from Data Scraped from Web Pages.

Using the Web Page XML Data Processed by Indigo DQM the Indigo DRS Reporting Engine can create extremely powerful reports for business intelligence on competitors Websites. This includes web indexing, data mining, Email and Link extraction, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.