Highly extensible, highly scalable Web crawler 1
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.
1. Features 1
- Transparency Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site’s contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
- Understanding We don’t have the source code to Google, so Nutch is probably the best we have. It’s interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the MapReduce. Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
- Extensibility Don’t like the way other search engines display their results? Write your own search engine–using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for
adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.
Data Structures 4
The web database is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links.
A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including
- the number of links in the page (also called outlinks);
- fetch information (such as when the page is due to be refetched);
- the page’s score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages).
link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.
A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it’s easy to tell how old they are.
The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes.
Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.
0. initialize CrawlDb, inject
generate-fetch-update cycle n times:
Injector takes all the URLs of the nutch.txt file and adds them to the
CrawlDB. As a central part of Nutch, the
CrawlDB maintains information on all known URLs (fetch schedule, fetch status, metadata, …).
2. Based on the data of
Generator creates a fetchlist and places it in a newly created
3. Next, the
Fetcher gets the content of the URLs on the fetchlist and writes it back to the
Segment directory. This step usually is the most time-consuming one.
4. Now the
Parser processes the content of each web page and for example omits all html tags. If the crawl functions as an update or an extension to an already existing one (e.g. depth of 3), the
Updater would add the new data to the
CrawlDB as a next step.
5. Before indexing, all the links need to be inverted by
Link Inverter, which takes into account that not the number of outgoing links of a web page is of interest, but rather the number of inbound links. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the
6-7. Using data from all possible sources (
Indexer creates an index and saves it within the Solr directory. For indexing, the popular Lucene library is used. Now, the user can search for information regarding the crawled web pages via Solr.
3. Install Nutch 6
1. OpenJDK 7 & ant
2. Nutch 2.3 RC (yes, you need 2.3, 2.2 will not work)
wget http://www.eu.apache.org/dist/nutch/2.3/apache-nutch-2.3-src.tar.gz tar -xzf apache-nutch-2.3-src.tar.gz
3. HBase 0.94.27 (HBase 0.98 won’t work)
wget https://www.apache.org/dist/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz tar -xzf hbase-0.94.27.tar.gz
4. ElasticSearch 1.7
wget https://www.apache.org/dist/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz tar -xzf hbase-0.94.27.tar.gz
Other Options: nutch-2.3, hbase-0.94.26, ElasticSearch 1.4
$HBASE_ROOT/conf/hbase-site.xml and add
<configuration> <property> <name>hbase.rootdir</name> <value>file:///full/path/to/where/the/data/should/be/stored</value> </property> <property> <name>hbase.cluster.distributed</name> <value>false</value> </property> </configuration>
$HBASE_ROOT/conf/hbase-env.sh and enable
JAVA_HOME and set it to the proper path:
-# export JAVA_HOME=/usr/java/jdk1.6.0/ +export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
This step might seem redundant, but even with
JAVA_HOME being set in my shell, HBase just didn’t recognize it.
3. kick off HBase:
4. Setting up Nutch
1. enable the HBase dependency in
$NUTCH_ROOT/ivy/ivy.xml by uncommenting the line
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
2. configure the HBase adapter by editing the
3. build Nutch
$ cd $NUTCH_ROOT $ ant clean $ ant runtime
This can take a while and creates
4. configure Nutch by editing
<configuration> <property> <name>http.agent.name</name> <value>mycrawlername</value> <!-- this can be changed to something more sane if you like --> </property> <property> <name>http.robots.agents</name> <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files --> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> </property> <property> <name>plugin.includes</name> <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! --> <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <!-- do not leave the seeded domains (optional) --> </property> <property> <name>elastic.host</name> <value>localhost</value> <!-- where is ElasticSearch listening --> </property> </configuration>
5. configure HBase integration by editing
<configuration> <property> <name>hbase.rootdir</name> <value>file:///full/path/to/where/the/data/should/be/stored</value> <!-- same path as you've given for HBase above --> </property> <property> <name>hbase.cluster.distributed</name> <value>false</value> </property> </configuration>
That’s it. Everything is now setup to crawl websites.
5. Use Nutch
5.1 Adding new Domains to crawl with Nutch
- create an empty directory. Add a textfile containing a list of seed URLs.
$ mkdir seed $ echo "https://www.website.com" >> seed/urls.txt $ echo "https://www.another.com" >> seed/urls.txt $ echo "https://www.example.com" >> seed/urls.txt
- inject them into Nutch by giving a file URL (!)
$ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/
5.2 Actual Crawling Procedure
- Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
$ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10
The above command will create job batches for 10 URLs.
- Fetch the URLs. We are not clustering, so we can simply fetch all batches:
$ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
- Now we parse all fetched pages:
$ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
- Last step: Update Nutch’s internal database:
$ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all
On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.
5.3 Putting Documents into ElasticSearch
$ $NUTCH_ROOT/runtime/local/bin/nutch index -all
Crawl nutch via proxy
<property> <name>http.proxy.host</name> <value>192.168.80.1</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>port</value> <description>The proxy port.</description> </property> <property> <name>http.proxy.username</name> <value>username</value> <description>Username for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. NOTE: For NTLM authentication, do not prefix the username with the domain, i.e. 'susam' is correct whereas 'DOMAINsusam' is incorrect. </description> </property> <property> <name>http.proxy.password</name> <value>password</value> <description>Password for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. </description> </property> </configuration>
6. Nutch Plugins 7
6.1 Extension Points
In writing a plugin, you’re actually providing one or more extensions of the existing extension-points . The core Nutch extension-points are themselves defined in a plugin, the NutchExtensionPoints plugin (they are listed in the NutchExtensionPoints plugin.xml file). Each extension-point defines an interface that must be implemented by the extension.
The core extension points are:
|IndexWriter||Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).|
|IndexingFilter||Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).|
|Parser||Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.|
|HtmlParseFilter||Permits one to add additional metadata to HTML parses (from javadoc).|
|Protocol||Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.|
|URLFilter||URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.|
|URLNormalizer||Interface used to convert URLs to normal form and optionally perform substitutions.|
|ScoringFilter||A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.|
|SegmentMergeFilter||Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.|
6.2 Getting Nutch to Use a Plugin 7
In order to get Nutch to use a given plugin, you need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes. Additionally we are required to add the various build configurations to build.xml in the plugin directory.
6.3 Project structure of a plugin
plugin-name plugin.xml # file that tells Nutch about the plugin. build.xml # file that tells ant how to build the plugin. ivy.xml # file that describes any dependencies required by the plugin. src org apache nutch indexer uml-meta # source folder URLMetaIndexingFilter.java scoring uml-meta # source folder URLMetaScoringFilter.java test org apache nutch indexer uml-meta # test folder URLMetaIndexingFilterTest.java scoring uml-meta # test folder URLMetaScoringFilterTest.java