Web Scraping and Web Crawling

Indigo DQM Data Management System
Web Scraping and Web Crawling	Indigo DQM Help

Indigo DQM's Data Management Engine (DME) features a powerful Web Scraper and Web Crawler that can be used to extract and process Data from HTML pages.

Web Scraping is used for content scraping and Data Extraction, and as a component of applications used for Web Indexing, Data Mining, E-mail and Link extraction. Web Crawling is the process of iteratively finding and fetching Data from a Website.

Using the Web Page XML Data processed by Indigo DQM the Indigo DRS Reporting Engine can create extremely powerful Reports for business intelligence on competitors Websites.

Indigo DQM can harvest a single URL or an entire Websites content using the Web Crawler. XQuery / XPath can used for Data Mining / Scraping extracting information such as products, prices, contacts etc. from a Webpage.

Data Mining / Scraping can also be used for online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web integration.

Select the Data Sources tab from the Indigo DQM Data Management Studio. Shared Sources can be used by all Data Command Queries and Execution Plans in the system.

Data Folders

Create a Data Folder for the Data Source to organise it in the Data Store.

Web Scraping HTML Websites / Webpages with the Web Scraper Single Data Provider

The Web Scraper Data Source is a HTML Web Page used for Data Mining / Extraction.

Enter the details for the Data Source and select a Data Folder.

Selecting a Web Scraping Data Source

Select Web Scraper Single as the type from the drop-down list and enter the location for the Web Page Data Source.

The Data Source is a Website or Webpage for Web Scraping /Crawling HTML and converting to XML for Data Mining, Data Queries and Data Reporting etc.

Click the 'Properties' button to configure the advanced settings and options for the Web Scraper / Crawler Data Source.

Click OK to update the Data Source.

Enabling the Web Crawler

The Web crawler is an Internet bot which systematically and automatically browses the World Wide Web, typically for the purpose of Web indexing (web spidering) and Web Scraping.

Enable the Web Crawler and Select the Pages the maximum number of pages and depth you wish to Crawl.

The Crawl Action is set to 'Content' by default which will download all the Web Page Content as HTML. The Crawl Action can also be set to 'Links', 'Emails' and 'MetaData' which will just retrieve the Web Page Links, Emails and MetaData accordingly.

Click OK to save the Data Source.

Web Scraping Multiple Websites / Webpages using the Web Scraper Multi Data Provider

Multi Web Scraping can Web Scrape and Web Crawl multiple Websites and Webpages specified in a list of URLs saved to a File.

Text Files containing the URLs of the Websites / Webpages to Data Scrape can be created using a Text Editor and saved for loading into the Multi Web Scraper.

The Data Scraper Multi Data Provider will read from each line in the File the URL and convert the HTML content to XML for Data Mining using XQuery / XPath.

http://www.indigodqm.com http://www.indigodqm.co.uk http://www.ajesoftware.co.uk http://www.ajeconsulting.co.uk

The File URI can be either a local or remote at a designated URL.

Click the 'Properties' button to configure the advanced settings and options for the Multi Web Scraper Data Provider.

If the Max URL's is set to Zero then there is no limit to the amount of URL's that are read from the File. The starting position in the File of the URL's to crawl can also be specified. If zero then the File will be read from the beginning.

Enabling the Web Crawler

The Web Crawler is an Internet bot which systematically and automatically browses the World Wide Web, typically for the purpose of Web indexing (web spidering) and Web Scraping.

Enable the Web Crawler and select the maximum number of pages and depth you wish to crawl.

Crawler Configuration File

The Web Crawler can be configured using an XML File. Copy the example below to a directory and point the Web Crawler Config File in the Data Scraper to the new File. Change the parameters in the Config File as required.

Example Web Crawler Configuration

<CrawlConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <MaxConcurrentThreads>10</MaxConcurrentThreads> <MaxPagesToCrawl>6</MaxPagesToCrawl> <MaxPagesToCrawlPerDomain>0</MaxPagesToCrawlPerDomain> <MaxPageSizeInBytes>0</MaxPageSizeInBytes> <UserAgentString>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</UserAgentString> <CrawlTimeoutSeconds>0</CrawlTimeoutSeconds> <IsUriRecrawlingEnabled>false</IsUriRecrawlingEnabled> <IsExternalPageCrawlingEnabled>false</IsExternalPageCrawlingEnabled> <IsExternalPageLinksCrawlingEnabled>false</IsExternalPageLinksCrawlingEnabled> <IsRespectUrlNamedAnchorOrHashbangEnabled>false</IsRespectUrlNamedAnchorOrHashbangEnabled> <DownloadableContentTypes>text/html</DownloadableContentTypes> <HttpServicePointConnectionLimit>200</HttpServicePointConnectionLimit> <HttpRequestTimeoutInSeconds>15</HttpRequestTimeoutInSeconds> <HttpRequestMaxAutoRedirects>7</HttpRequestMaxAutoRedirects> <IsHttpRequestAutoRedirectsEnabled>true</IsHttpRequestAutoRedirectsEnabled> <IsHttpRequestAutomaticDecompressionEnabled>false</IsHttpRequestAutomaticDecompressionEnabled> <IsSendingCookiesEnabled>false</IsSendingCookiesEnabled> <IsSslCertificateValidationEnabled>true</IsSslCertificateValidationEnabled> <MinAvailableMemoryRequiredInMb>0</MinAvailableMemoryRequiredInMb> <MaxMemoryUsageInMb>0</MaxMemoryUsageInMb> <MaxMemoryUsageCacheTimeInSeconds>0</MaxMemoryUsageCacheTimeInSeconds> <MaxCrawlDepth>100</MaxCrawlDepth> <MaxLinksPerPage>0</MaxLinksPerPage> <IsForcedLinkParsingEnabled>false</IsForcedLinkParsingEnabled> <MaxRetryCount>0</MaxRetryCount> <MinRetryDelayInMilliseconds>0</MinRetryDelayInMilliseconds> <IsRespectRobotsDotTextEnabled>false</IsRespectRobotsDotTextEnabled> <IsRespectMetaRobotsNoFollowEnabled>false</IsRespectMetaRobotsNoFollowEnabled> <IsRespectHttpXRobotsTagHeaderNoFollowEnabled>false</IsRespectHttpXRobotsTagHeaderNoFollowEnabled> <IsRespectAnchorRelNoFollowEnabled>false</IsRespectAnchorRelNoFollowEnabled> <IsIgnoreRobotsDotTextIfRootDisallowedEnabled>false</IsIgnoreRobotsDotTextIfRootDisallowedEnabled> <RobotsDotTextUserAgentString>abot</RobotsDotTextUserAgentString> <MinCrawlDelayPerDomainMilliSeconds>0</MinCrawlDelayPerDomainMilliSeconds> <MaxRobotsDotTextCrawlDelayInSeconds>5</MaxRobotsDotTextCrawlDelayInSeconds> <IsAlwaysLogin>false</IsAlwaysLogin> <LoginUser></LoginUser> <LoginPassword></LoginPassword> </CrawlConfiguration>

Web Crawler Filter and Filter Files

The Web Crawler Filter can be used to include or exclude URL's in the Web Index.

Enter a Filter term and select the Filter action to either include or exclude any Web Pages in the Crawl containing the Filter term in the URL.

The Crawl Filter is a comma separated list of terms to Filter. The File Filter is a location to a File list of Filter terms. In the File Filter each new Filter term starts on a new line.

The Link Filter similarly includes or excludes Links in the Link Extract for each Web Page Crawled.

Connect to the Web Page HTML Data Source

Connect to the HTML Web Page using the XQuery Designer to the View as XML Data for Scraping, Query and Data Extraction.

Extracting the Web Page Title using XQuery

Executing an XQuery statement for this Web Scrape to Extract the Web Page Title.

Executing an XQuery statement for this Web Scrape to Extract the Web Page Keywords.

Indigo DRS Report Designer for advanced Web Scraping and Reporting

The Indigo DRS Report Designer can be used to create advanced Reports and Outputs from Data Scraped from Web Pages.

Using the Web Page XML Data Processed by Indigo DQM the Indigo DRS Reporting Engine can create extremely powerful reports for business intelligence on competitors Websites. This includes web indexing, data mining, Email and Link extraction, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.