慧/利

[Hadoop]02 hBase로 구축하는 대규모 분산 데이터 처리 시스템

I T69 U 2010. 3. 5. 12:46

2. Hadoop & hBase 개요

 Google(Technology)의 기반 소프트웨어에 공통되는 특징은, 매우 Scalable하며 또한 하드웨어 장애에 강하다고 하는 점이다. Google 입장에서는 전세계의 데이터를 수집 정보처리를 수행할 수 있는 능력이 필요했었기 때문에, 대규모의 하드웨어를 운용하며 데이터를 저장 유지, 정보처리가 가능한 기술은 어쩌면 필연에 의한 산물일 수 밖에 없다.

 

 본 글이 소개하는Hadoop, hBase는 이러한 Google(Technology) 독자개발 기반 소프트웨어의 Open source clone이다. Hadoop은 GFS(Google File System)와 MapReduce, hBase는 BigTable로 치환된다. Chubby의 open source는 이미 언급했듯 아직까지 실제 적용 사례가 없다.

 

 최근에는 Web어플리케이션이 전성기를 맞아 이전보다 많은 데이터나 로그가 축적되고 있는 사이트도 다수 존재한다고 볼 수 있다. 그러나, 방대한 데이터를 구축하고도 그것들을 유효하고 효율적인 정보처리와 활용이 이뤄지기는 쉽지 않다.

 

 그렇기 때문에 본 글을 통하여 Google(Technology) 독자개발 기반 소프트웨어 Clone인 Hadoop이나 hBase를 실제로 적용, 그 유효성을 검증하며, 잠자고 있는 방대한 데이터의 홍수 속에서 유효 적절한 정보처리 처리 구현을 위한 대안 솔루션으로서 쉽게 접근할 수 있기를 바라며 보다 많은 이들에게 소개되기를 원한다. 단일(1대) 장비만으로도 충분히 적용 구현할 수 있기 때문에, 현장 실무자들이라면 꼭 한번 경험해 볼 것을 권한다.

 

 그러면 Hadoop, hBase의 개요로 들어가겠다.

 

 

 

2.1 Hadoop의 개요

 Hadoop은 주로 야후 본사(Yahoo! Inc.)의 "Doug Cutting"씨에 의해 개발이 진행되고 있는 오픈소스 소프트웨어이다. 현재는Apache 프로젝트의 Top Project를 주도하고 있다. Hadoop라는 이름은 Doug씨의 자녀가 실제 가지고 있는 노란 코끼리의 봉제인형의 이름에 유래하고 있다.

 

 Hadoop은 HDFS(Hadoop Distributed File System), Hadoop MapReduce Framework로부터 구성된다. Google(Technology) 독자개발 기반기술로 치환시켜 보면, HDFS는 GFS, Hadoop MapReduce Framework는 MapReduce에 해당한다.

 

 Hadoop은 모두 Java로 개발되었고 MapReduce 처리를 쓰는 경우도 기본적으로는 Java로 개발하는 것을 전제하고 있다. 다만Hadoop Streaming이라고 하는 확장 패키지를 이용하면, C/C++, Ruby, Python등 임의의 언어와 표준 입출력을 이용해 MapReduce처리를 할 수가 있다.

 

 개발을 주도하고 있는야후는 물론, Facebook에서도 실제로 적용 구현하고 있다. 특히 정보처리 로그 분석을 목적으로 한 적용이 눈에 띄인다.

 

 Hadoop 관련 사이트

(현재 한국어 정보는 거의 없다)

 

Hadoop Official Site

Hadoop Official Wiki

인스톨 방법소개(복수 노드)

Hadoop DFS 목표 & 특징

Hadoop DFS Architecture

Hadoop MapReduce Tutorial

 

 

2.2 hBase 개요

 hBase란 BigTable의 Open source Clone을 말한다. 현재 Web검색 엔진을 개발하고 있는 Powerset사의 엔지니어가 개발를 주도하고 있으며, Hadoop과 마찬가지로 Java로 개발되고 있다. hBase는 Hadoop에 의존하고 있어, 실제의 데이터는 HDFS 상에 저장된다. 지난 버전의 Source code는 Hadoop에 포함되어 있지만, 현재는 개발방향 등이 갈려져 있으니 주의해서 챙길 필요가 있다.

 

 

 

hBase Official Site

hBase Official Wiki

hBase 인스톨 방법소개

hBase Architecture & 사용방법

 

 

 

 

 

아래는 현재 채택 적용한 사이트 실적

 

Applications and organizations using Hadoop include (alphabetically):

  • A9.com - Amazon

    • We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
    • We process millions of sessions daily for analytics, using both the Java and streaming APIs.
    • Our clusters vary from 1 to 100 nodes.
  • Adobe

    • We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
    • We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
    • We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.

    • Our production cluster has been running since Oct 2008.
  • Able Grape - Vertical search engine for trustworthy wine information

    • We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/node)
    • Hadoop and Nutch used to analyze and index textual information
  • Adknowledge - Ad network

    • Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
    • We handle 500MM clickstream events per day
    • Our clusters vary from 50 to 200 nodes, mostly on EC2.
    • Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
  • Alibaba

    • A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
    • Each node has 8 cores, 16G RAM and 1.4T storage.
  • Amazon Web Services

    • We provide Amazon Elastic MapReduce. It's a web service that provides a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

    • Our customers can instantly provision as much or as little capacity as they like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
  • AOL

    • We use hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
    • Our cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.
  • Atbrox

    • We use hadoop for information extraction & search, and data analysis consulting

    • Cluster: we primarily use Amazon's Elastic Mapreduce
  • BabaCar

    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop for searching and analysis of millions of rental bookings.
  • backdocsearch.com - search engine for chiropractic information, local chiropractors, products and schools

  • Baidu - the leading Chinese language search engine

    • Hadoop used to analyze the log of search and do some mining work on web page database
    • We handle about 3000TB per week
    • Our clusters vary from 10 to 500 nodes
    • Hypertable is also supported by Baidu
  • Beebler

    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop for matching dating profiles
  • Bixo Labs - Elastic web mining

    • The Bixolabs elastic web mining platform uses Hadoop + Cascading to quickly build scalable web mining applications.
    • We're doing a 200M page/5TB crawl as part of the public terabyte dataset project.

    • This runs as a 20 machine Elastic MapReduce cluster.

  • BrainPad - Data mining and analysis

    • We use Hadoop to summarize of user's tracking data.
    • And use analyzing.
  • Cascading - Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.

  • Cloudera, Inc - Cloudera provides commercial support and professional training for Hadoop.

  • Contextweb - ADSDAQ Ad Excange

    • We use Hadoop to store ad serving log and use it as a source for Ad optimizations/Analytics/reporting/machine learning.
    • Currently we have a 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.
  • Cooliris - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive.

    • We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
    • We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.
  • Cornell University Web Lab

    • Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
  • Deepdyve

    • Elastic cluster with 5-80 nodes
    • We use hadoop to create our indexes of deep web content and to provide a high availability and high bandwidth storage service for index shards for our search cluster.
  • Detikcom - Indonesia's largest news portal

    • We use hadoop, pig and hbase to analyze search log, generate Most View News, generate top wordcloud, and analyze all of our logs
    • Currently We use 9 nodes
  • DropFire

    • We generate Pig Latin scripts that describe structural and semantic conversions between data contexts
    • We use Hadoop to execute these scripts for production-level deployments
    • Eliminates the need for explicit data and schema mappings during database integration
  • Enormo

    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
    • We plan to use Pig very shortly to produce statistics.
  • ESPOL University (Escuela Superior Politécnica del Litoral) in Guayaquil, Ecuador

    • 4 nodes proof-of-concept cluster.
    • We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
    • The students use on-demand clusters launched using Amazon's EC2 and EMR services, thanks to its AWS in Education program.
  • ETH Zurich Systems Group

    • We are using Hadoop in a course that we are currently teaching: "Massively Parallel Data Analysis with MapReduce". The course projects are based on real use-cases from biological data analysis.

    • Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
  • Eyealike - Visual Media Search Platform

    • Facial similarity and recognition across large datasets.
    • Image content based advertising and auto-tagging for social media.
    • Image based video copyright protection.
  • Facebook

    • We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
    • Currently we have 2 major clusters:
      • A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
      • A 300-machine cluster with 2400 cores and about 3 PB raw storage.
      • Each (commodity) node has 8 cores and 12 TB of storage.
    • We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over hdfs.

  • FOX Audience Network

    • 40 machine cluster (8 cores/machine, 2TB/machine storage)
    • 70 machine cluster (8 cores/machine, 3TB/machine storage)
    • 30 machine cluster (8 cores/machine, 4TB/machine storage)
    • Use for log analysis, data mining and machine learning
  • Freestylers - Image retrieval engine

    • We Japanese company Freestylers use Hadoop to build the image processing environment for image-based product recommendation system mainly on Amazon EC2, from April 2009.
    • Our Hadoop environment produces the original database for fast access from our web application.
    • We also uses Hadoop to analyzing similarities of user's behavior.
  • Google

  • Gruter. Corp.

    • 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    • storage for blog data and web documents
    • used for data indexing by MapReduce

    • link analyzing and Machine Learning by MapReduce

  • Hadoop Korean User Group, a Korean Local Community Team Page.

    • 50 node cluster In the Korea university network environment.
      • Pentium 4 PC, HDFS 4TB Storage
    • Used for development projects
      • Retrieving and Analyzing Biomedical Knowledge
      • Latent Semantic Analysis, Collaborative Filtering
  • Hulu

    • 13 machine cluster (8 cores/machine, 4TB/machine)
    • Log storage and analysis
    • Hbase hosting
  • Hadoop Taiwan User Group

  • Hipotecas y euribor

    • Evolución del euribor y valor actual
    • Simulador de hipotecas en crisis económica
  • Hosting Habitat

    • We use a customised version of Hadoop and Nutch in a currently experimental 6 node/Dual Core cluster environment.
    • What we crawl are our clients Websites and from the information we gather. We fingerprint old and non updated software packages in that shared hosting environment. We can then inform our clients that they have old and non updated software running after matching a signature to a Database. With that information we know which sites would require patching as a free and courtesy service to protect the majority of users. Without the technologies of Nutch and Hadoop this would be a far harder to accomplish task.
  • IBM

  • ICCS

    • We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
  • IIIT, Hyderabad

    • We use hadoop for Information Retrieval and Extraction research projects. Also working on map-reduce scheduling research for multi-job environments.
    • Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. Heterogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node. Also some nodes with dual core and single core configurations.
  • ImageShack

    • From TechCrunch:

      • Rather than put ads in or around the images it hosts, Levin is working on harnessing all the data his

        service generates about content consumption (perhaps to better target advertising on ImageShack or to syndicate that targetting data to ad networks). Like Google and Yahoo, he is deploying the open-source Hadoop software to create a massive distributed supercomputer, but he is using it to analyze all the data he is collecting.

  • Information Sciences Institute (ISI)

  • Iterend

    • using 10 node hdfs cluster to store and process retrieved data.
  • Joost

    • Session analysis and report generation
  • Journey Dynamics

    • Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product.

  • Karmasphere

    • Distributes Karmasphere Studio for Hadoop, which allows cross-version development and management of Hadoop jobs in a familiar integrated development environment.

  • Katta - Katta serves large Lucene indexes in a grid environment.

  • Koubei.com Large local community and local search at China.

    • Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
  • Krugle

    • Source code search engine uses Hadoop and Nutch.
  • Last.fm

    • 31 nodes
    • Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
    • Used for charts calculation, log analysis, A/B testing
  • LinkedIn

    • 3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration.
    • We use Hadoop and Pig for discovering People You May Know and other fun facts.
  • Lookery

    • We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
    • Our cluster runs across Amazon's EC2 webservice and makes use of the streaming module to use Python for most operations.
  • Lotame

    • Using Hadoop and Hbase for storage, log analysis, and pattern discovery/analysis.
  • Makara

    • Using ZooKeeper on 2 node cluster on VMware workstation, Amazon EC2, Zen

    • Using zkpython
    • Looking into expanding into 100 node cluster
  • MicroCode

    • 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage)
    • Financial data for search and aggregation
    • Customer Relation Management data for search and aggregation
  • Media 6 Degrees

    • 20 node cluster (dual quad cores, 16GB, 6TB)
    • Used log processing, data analysis and machine learning.
    • Focus is on social graph analysis and ad optimization.
    • Use a mix of Java, Pig and Hive.
  • MyLife

    • 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage)
    • Powers data for search and aggregation
  • Mahout

    • Another Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)
  • MetrixCloud - provides commercial support, installation, and hosting of Hadoop Clusters. Contact Us.

  • Neptune

    • Another Bigtable cloning project using Hadoop to store large structured data set.
    • 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
  • NetSeer -

    • Up to 1000 instances on Amazon EC2

    • Data storage in Amazon S3

    • 50 node cluster in Coloc
    • Used for crawling, processing, serving and log analysis
  • The New York Times

  • Ning

    • We use Hadoop to store and process our log files
    • We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries

    • We use commodity hardware, with 8 cores and 16 GB of RAM per machine
  • Nutch - flexible web search engine software

  • PARC - Used Hadoop to analyze Wikipedia conflicts paper.

  • Powerset / Microsoft - Natural Language Search

  • Pressflip - Personalized Persistent Search

    • Using Hadoop on EC2 to process documents from a continuous web crawl and distributed training of support vector machines
    • Using HDFS for large archival data storage
  • PSG Tech, Coimbatore, India

    • Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. Scalable nature of Hadoop makes it apt to solve large scale alignment problems.
    • Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
  • Quantcast

    • 3000 cores, 3500TB. 1PB+ processing each day.
    • Hadoop scheduler with fully custom data path / sorter
    • Significant contributions to KFS filesystem
  • Rackspace

    • 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage)
  • Rapleaf

    • 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB RAM)
    • We use hadoop to process data relating to people on the web
    • We also involved with Cascading to help simplify how our data flows through various processing stages
  • Redpoll

    • Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
    • We intend to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets.
  • Search Wikia

    • A project to help develop open source social search tools. We run a 125 node hadoop cluster.
  • SEDNS - Security Enhanced DNS Group

    • We are gathering world wide DNS data in order to discover content distribution networks and

      configuration issues utilizing Hadoop DFS and MapRed.

  • Socialmedia.com

    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop to process log data and perform on-demand analytics
  • Spadac.com

    • We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing.

    • We use HDFS and MapReduce to store, process, and index geospatial imagery and vector data.

    • MrGeo is soon to be open sourced as well.

  • Stampede Data Solutions (Stampedehost.com)

    • Hosted Hadoop data warehouse solution provider
  • Taragana - Web 2.0 Product development and outsourcing services

    • We are using 16 consumer grade computers to create the cluster, connected by 100 Mbps network.
    • Used for testing ideas for blog and other data mining.
  • The Lydia News Analysis Project - Stony Brook University

    • We are using Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources.
  • Tailsweep - Ad network for blogs and social media

    • 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 storage)
    • Used as a proof of concept cluster
    • Handling i.e. data mining and blog crawling
  • Technical analysis and Stock Research

    • Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36GB Hard Drive)
  • Telefonica Research

    • We use Hadoop in our data mining and user modeling, multimedia, and internet research groups.
    • 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine.
  • University of Glasgow - Terrier Team

    • 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage).

      We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce.

  • University of Maryland

    • We are one of six universities participating in IBM/Google's academic cloud computing initiative. ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
  • University of Nebraska Lincoln, Research Computing Facility

    • We currently run one medium-sized Hadoop cluster (200TB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Hadoop.
  • Veoh

    • We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
  • Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.

  • VK Solutions

    • We use a small Hadoop cluster in the scope of our general research activities at VK Labs to get a faster data access from web applications.

    • We also use Hadoop for filtering and indexing listing, processing log analysis, and for recommendation data.
  • Vuelos baratos

    • We use a small Hadoop
  • WorldLingo

    • Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • Each server runs Xen with one Hadoop/HBase instance and another instance with web or application servers, giving us 88 usable virtual machines.
    • We run two separate Hadoop/HBase clusters with 22 nodes each.
    • Hadoop is primarily used to run HBase and Map/Reduce jobs scanning over the HBase tables to perform specific tasks.
    • HBase is used as a scalable and fast storage back end for millions of documents.
    • Currently we store 12million documents with a target of 450million in the near future.
  • Yahoo!

    • More than 100,000 CPUs in >25,000 computers running Hadoop

    • Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)

      • Used to support research for Ad Systems and Web Search
      • Also used to do scaling tests to support development of Hadoop on larger clusters
    • Our Blog - Learn more about how we use Hadoop.

    • >40% of Hadoop Jobs within Yahoo are Pig jobs.

  • Zvents

    • 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
    • Run Naive Bayes classifiers in parallel over crawl data to discover event information

When applicable, please include details about your cluster hardware and size.

PoweredBy (2010-02-25 18:49:01에 voyager가(이) 마지막으로 수정)