Apache Nutch

Highly extensible, highly scalable Web crawler 1

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

1. Features 1

  1. Transparency Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site’s contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
  2. Understanding We don’t have the source code to Google, so Nutch is probably the best we have. It’s interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the MapReduce. Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
  3. Extensibility Don’t like the way other search engines display their results? Write your own search engine–using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for
    adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.

2. Architectures 2 3

Data Structures 4

The web database is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links.

A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including

  • the number of links in the page (also called outlinks);
  • fetch information (such as when the page is due to be refetched);
  • the page’s score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages).

A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.

A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it’s easy to tell how old they are.

The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes.

Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.

View gora-hbase-mapping.xml for more details

Process 5

0. initialize CrawlDb, inject seed URLs

Repeat generate-fetch-update cycle n times:

1. The Injector takes all the URLs of the nutch.txt file and adds them to the CrawlDB. As a central part of Nutch, the CrawlDB maintains information on all known URLs (fetch schedule, fetch status, metadata, …).
2. Based on the data of CrawlDB, the Generator creates a fetchlist and places it in a newly created Segment directory.
3. Next, the Fetcher gets the content of the URLs on the fetchlist and writes it back to the Segment directory. This step usually is the most time-consuming one.
4. Now the Parser processes the content of each web page and for example omits all html tags. If the crawl functions as an update or an extension to an already existing one (e.g. depth of 3), the Updater would add the new data to the CrawlDB as a next step.
5. Before indexing, all the links need to be inverted by Link Inverter, which takes into account that not the number of outgoing links of a web page is of interest, but rather the number of inbound links. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the Linkdb.
6-7. Using data from all possible sources (CrawlDB, LinkDB and Segments), the Indexer creates an index and saves it within the Solr directory. For indexing, the popular Lucene library is used. Now, the user can search for information regarding the crawled web pages via Solr.

3. Install Nutch 6

Requirements

1. OpenJDK 7 & ant

2. Nutch 2.3 RC (yes, you need 2.3, 2.2 will not work)

wget http://www.eu.apache.org/dist/nutch/2.3/apache-nutch-2.3-src.tar.gz
tar -xzf apache-nutch-2.3-src.tar.gz 

3. HBase 0.94.27 (HBase 0.98 won’t work)

wget https://www.apache.org/dist/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
tar -xzf hbase-0.94.27.tar.gz

4. ElasticSearch 1.7

wget https://www.apache.org/dist/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
tar -xzf hbase-0.94.27.tar.gz

Other Options: nutch-2.3, hbase-0.94.26, ElasticSearch 1.4

Setup HBase

1. edit $HBASE_ROOT/conf/hbase-site.xml and add

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

2. edit $HBASE_ROOT/conf/hbase-env.sh and enable JAVA_HOME and set it to the proper path:

-# export JAVA_HOME=/usr/java/jdk1.6.0/
+export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn’t recognize it.

3. kick off HBase:

$HBASE_ROOT/bin/start-hbase.sh

4. Setting up Nutch

1. enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

2. configure the HBase adapter by editing the $NUTCH_ROOT/conf/gora.properties:

-#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
+gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

3. build Nutch

$ cd $NUTCH_ROOT
$ ant clean
$ ant runtime

This can take a while and creates $NUTCH_ROOT/runtime/local.

4. configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value> <!-- do not leave the seeded domains (optional) -->
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value> <!-- where is ElasticSearch listening -->
  </property>
</configuration>

5. configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value> <!-- same path as you've given for HBase above -->
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

That’s it. Everything is now setup to crawl websites.

5. Use Nutch

5.1 Adding new Domains to crawl with Nutch
  1. create an empty directory. Add a textfile containing a list of seed URLs.
    $ mkdir seed
    $ echo "https://www.website.com" >> seed/urls.txt
    $ echo "https://www.another.com" >> seed/urls.txt
    $ echo "https://www.example.com" >> seed/urls.txt
    
  2. inject them into Nutch by giving a file URL (!)
    $ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/
    
5.2 Actual Crawling Procedure
  1. Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
    $ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10
    

    The above command will create job batches for 10 URLs.

  2. Fetch the URLs. We are not clustering, so we can simply fetch all batches:
    $ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
    
  3. Now we parse all fetched pages:
    $ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
    
  4. Last step: Update Nutch’s internal database:
    $ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all
    

On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.

5.3 Putting Documents into ElasticSearch

Easy peasy:

$ $NUTCH_ROOT/runtime/local/bin/nutch index -all
5.4 Configuration

Crawl nutch via proxy

Change $NUTCH_ROOT/runtime/local/conf/nutch-site.xml

<property>
  <name>http.proxy.host</name>
  <value>192.168.80.1</value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>port</value>
  <description>The proxy port.</description>
</property>
<property>
  <name>http.proxy.username</name>
  <value>username</value>
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAINsusam' is incorrect.
  </description>
</property>

<property>
  <name>http.proxy.password</name>
  <value>password</value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>
</configuration>

6. Nutch Plugins 7

6.1 Extension Points

In writing a plugin, you’re actually providing one or more extensions of the existing extension-points . The core Nutch extension-points are themselves defined in a plugin, the NutchExtensionPoints plugin (they are listed in the NutchExtensionPoints plugin.xml file). Each extension-point defines an interface that must be implemented by the extension.

The core extension points are:

Point Description Example
IndexWriter Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
IndexingFilter Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
Parser Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
HtmlParseFilter Permits one to add additional metadata to HTML parses (from javadoc).
Protocol Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
URLFilter URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions.
ScoringFilter A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
SegmentMergeFilter Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

6.2 Getting Nutch to Use a Plugin 7

In order to get Nutch to use a given plugin, you need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes. Additionally we are required to add the various build configurations to build.xml in the plugin directory.

6.3 Project structure of a plugin

plugin-name
  plugin.xml   # file that tells Nutch about the plugin.
  build.xml    # file that tells ant how to build the plugin.
  ivy.xml      # file that describes any dependencies required by the plugin.
  src
    org
      apache
        nutch
          indexer
            uml-meta # source folder
              URLMetaIndexingFilter.java
          scoring
            uml-meta # source folder
              URLMetaScoringFilter.java
  test
    org
      apache
        nutch
          indexer
            uml-meta # test folder
              URLMetaIndexingFilterTest.java
          scoring
            uml-meta # test folder
              URLMetaScoringFilterTest.java

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s