apache atlas工作实践实例.docx

资源描述

apache atlas工作实践实例.docx

《apache atlas工作实践实例.docx》由会员分享，可在线阅读，更多相关《apache atlas工作实践实例.docx（22页珍藏版）》请在冰豆网上搜索。

apache atlas工作实践实例.docx

apacheatlas工作实践实例

Apacheatlas的入门教程

其实，在本人之前的文章中有介绍，它是一个用在hadoop上的数据治理和元数据框架工具。

它是基于hadoop平台上，能无缝对接hadoop平台的组件。

前端UI默认使用solr5，有丰富的restAPI，后端数据库可以是hive，hbase等。

能导入不同格式的数据源，包括hive，hbase等（传统数据库，暂不清楚）。

1.Apacheatlas安装

安装步骤，官网上面有，链接：

https:

//atlas.apache.org/InstallationSteps.html

为方便操作，简单翻译一下步骤：

环境：

JDK8

MAVEN3.X

GIT

PYTHON2.7以上

（1）buildingatlas（构建atlas）

gitclonehttps:

//git-wip-us.apache.org/repos/asf/atlas.gitatlas

cdatlas

exportMAVEN_OPTS="-Xms2g-Xmx4g"

mvnclean-DskipTestsinstall

注意：

服务器内存至少要4G。

笔者升级了几次配置。

这是笔者的截图：

文件很多，要下比较1-2个小时,中间可能也有fail。

（2）打包atlas

（机器上已经装有hbase和solr）

mvnclean-DskipTestspackage-Pdist

（机器上没有装hbase和solr，atlas自带hbase和solr）

mvnclean-DskipTestspackage-Pdist,embedded-hbase-solr

本文这里选了后一种。

（3）打包完，会在根目录下生成以下的包：

（4）安装atlas

tar-xzvfapache-atlas-${project.version}-bin.tar.gz

cdatlas-${project.version}

目前它会自动解压，这一步可以不要。

下载完成后，目录结构：

其中，atlas_home/distro/target下面，

apache-atlas-1.0.0-SNAPSHOT-bin是其解压后的目录：

注意：

接下来是配置步骤。

先看完黑体字，再接着看下文。

如果只是atlas默认配置启动，命令：

cd/apache_atlas/atlas/distro/target/apache-atlas-1.0.0-SNAPSHOT-bin/apache-atlas-1.0.0-SNAPSHOT

bin/atlas_start.py

测试：

curl-vhttp:

//localhost:

21000/api/atlas/admin/version

报错：

Error401Fullauthenticationisrequiredtoaccessthisresource

HTTPERROR401

Problemaccessing/api/atlas/admin/version.Reason:

Fullauthenticationisrequiredtoaccessthisresource

原因：

没有权限，正确命令：

curl-v-uusername:

passwordhttp:

//localhost:

21000/api/atlas/admin/version

username：

默认admin

password：

默认admin

curl-v-uadmin:

adminhttp:

//localhost:

21000/api/atlas/admin/version

这样就成功了。

上面的启动，solr，hbase是内嵌式的，solr端口是9838，跟独立安装的默认端口8983不一样。

如果需要自定义配置，尤其是使用hbase做图库的存储后端（HBaseastheStorageBackendfortheGraphRepository），solr做图表库的索引后端（SOLRastheIndexingBackendfortheGraphRepository），请看下文。

（5）配置项。

conf/atlas-env.sh

#Thejavaimplementationtouse.IfJAVA_HOMEisnotfoundweexpectjavaandjartobeinpath

#exportJAVA_HOME=

#anyadditionaljavaoptsyouwanttoset.Thiswillapplytobothclientandserveroperations

#exportATLAS_OPTS=

#anyadditionaljavaoptsthatyouwanttosetforclientonly

#exportATLAS_CLIENT_OPTS=

#javaheapsizewewanttosetfortheclient.Defaultis1024MB

#exportATLAS_CLIENT_HEAP=

#anyadditionaloptsyouwanttosetforatlasservice.

#exportATLAS_SERVER_OPTS=

#javaheapsizewewanttosetfortheatlasserver.Defaultis1024MB

#exportATLAS_SERVER_HEAP=

#Whatisisconsideredasatlashomedir.Defaultisthebaselocationoftheinstalledsoftware

#exportATLAS_HOME_DIR=

#Wherelogfilesarestored.Defatultislogsdirectoryunderthebaseinstalllocation

#exportATLAS_LOG_DIR=

#Wherepidfilesarestored.Defatultislogsdirectoryunderthebaseinstalllocation

#exportATLAS_PID_DIR=

#Wheredoyouwanttoexpandthewarfile.ByDefaultitisin/server/webappdirunderthebaseinstalldir.

#exportATLAS_EXPANDED_WEBAPP_DIR=

如果/etc/profile没有配JAVA_HOME，需要配JAVA_HOME。

配置conf/atlas-application.properties：

#使用hbasetables

atlas.graph.storage.hbase.table=atlas

atlas.audit.hbase.tablename=apache_atlas_entity_audit

这一步，需要安装独立的solr集群，使用zookeeper是solr集群高可用。

参考链接：

https:

//cwiki.apache.org/confluence/display/solr/SolrCloud

启动solr集群：

cdsolr/bin

./solrcreate-cvertex_index-dSOLR_CONF-shards#numShards-replicationFactor#replicationFactor

./solrcreate-cedge_index-dSOLR_CONF-shards#numShards-replicationFactor#replicationFactor

./solrcreate-cfulltext_index-dSOLR_CONF-shards#numShards-replicationFactor#replicationFactor

SOLR_CONF:

solrconfig.xml所在的目录，其实笔者之前也一直没有搞清楚。

笔者这里是：

/usr/local/solr-5.5.1

如果不知道要创建多少numShards，可忽略，默认是1。

笔者的配置如下：

cd/apache_atlas/atlas/distro/target/solr/bin

exportSOLR_CONF=/usr/local/solr-5.5.1

./solrstart-c-zlocalhost:

2181-p8983

./solrcreate-cvertex_index-d$SOLR_CONF

./solrcreate-cedge_index-d$SOLR_CONF

./solrcreate-cfulltext_index-d$SOLR_CONF

启动solr集群后，在atlas-application.properties中配置：

atlas.kafka.zookeeper.connect=localhost:

2181

atlas.graph.index.search.backend=solr5

atlas.graph.index.search.solr.mode=cloud

atlas.graph.index.search.solr.zookeeper-url=10.1.6.4:

2181,10.1.6.5:

2181

atlas.graph.index.search.solr.zookeeper-connect-timeout=60000ms

atlas.graph.index.search.solr.zookeeper-session-timeout=60000ms

启动hbase：

cdhbase/bin

./start-hbase.sh

启动atlas：

bin/atlas_start.py

atlasUI界面：

http:

//localhost:

21000/

错误1：

java.io.FileNotFoundException:

/apache_atlas/atlas/distro/target/server/webapp/atlas.war（Nosuchfileordirectory）

atjava.io.FileInputStream.open0（NativeMethod）

atjava.io.FileInputStream.open（FileInputStream.java:

195）

atjava.io.FileInputStream.（FileInputStream.java:

138）

atjava.io.FileInputStream.（FileInputStream.java:

93）

atsun.tools.jar.Main.run（Main.java:

307）

atsun.tools.jar.Main.main（Main.java:

1288）

TheServerisnolongerrunningwithpid6353

configuredforlocalhbase.

hbasestarted.

configuredforlocalsolr.

solrstarted.

settingupsolrcollections...

startingatlasonhostlocalhost

startingatlasonport21000

这是atlas启动的路径不对导致。

网上没有该解决方法。

后来发现启动的路径不对，笔者这里，之前启动路径是：

/apache_atlas/atlas/distro/target/

正确的启动路径是：

/apache_atlas/atlas/distro/target/apache-atlas-1.0.0-SNAPSHOT-bin/apache-atlas-1.0.0-SNAPSHOT/

错误2：

/apache_atlas/atlas/distro/target/logs错误日志会有：

ERROR:

Collection'vertex_index'alreadyexists!

CheckedcollectionexistenceusingCollectionsAPIcommand:

http:

//localhost:

9838/solr/admin/collections?

action=list

这是重名collection冲突。

命令：

jps

看看是否有多个jar进程。

该进程是solr进程。

希望别人不要犯跟我一样的错误。

错误3：

2018-05-0511:

10:

18,545WARN-[main:

]~Exceptionencounteredduringcontextinitialization-cancellingrefreshattempt:

org.springframework.beans.factory.BeanCreationException:

Errorcreatingbeanwithname'services':

Invocationofinitmethodfailed;nestedexceptionisjava.lang.RuntimeException:

org.apache.atlas.AtlasException:

Failedtostartembeddedkafka（AbstractApplicationContext:

550）

2018-05-0511:

10:

18,699ERROR-[main:

]~Contextinitializationfailed（ContextLoader:

350）

org.springframework.beans.factory.BeanCreationException:

Errorcreatingbeanwithname'services':

Invocationofinitmethodfailed;nestedexceptionisjava.lang.RuntimeException:

org.apache.atlas.AtlasException:

Failedtostartembeddedkafka

Causedby:

org.apache.atlas.AtlasException:

Failedtostartembeddedkafka

atorg.apache.atlas.kafka.EmbeddedKafkaServer.start（EmbeddedKafkaServer.java:

83）

atorg.apache.atlas.service.Services.start（Services.java:

67）

...40more

Causedby:

.BindException:

Addressalreadyinuse

原因：

端口占用。

查看conf/atlas-application.properties

atlas.kafka.zookeeper.connect=localhost:

端口是否占用。

简单说了下atlas安装，接下来说下使用方式。

2.Apacheatlas使用方式

再说下restapi的使用方式。

需要关注的地方：

ApacheatlasApi主要是对Type，Entity，Attribute这3个构件的增删改查操作。

这听起来有点意外，实际上，其它的很多东西被封装了，还有包含在配置文件里，留下Api和AdminUI供外部调用。

简单介绍一下这几个构件。

Type：

Atlas中的“类型”是一个定义,说明如何存储并访问特定类型的元数据对象,。

类型表示一个特征或一个特性集合,这些属性定义了元数据对象。

具有开发背景的用户将识别类型的相似性,以面向对象编程语言的“Class”定义或关系的“tableschema”数据库。

Entity：

Atlas中的一个“实体”是类“type”的特定值或实例,因此表示特定的

现实世界中的元数据对象。

回指我们的面向对象的类比

编程语言,“instance”是某个“Class”的“Object”。

Attribute：

属性定义在复合metatypes中,如Class和Struct。

可以简单将属性称为具有名称和metatype值。

然而

Atlas中的属性有更多的属性来定义与typesystem相关的更多概念。

上面的定义难以理解。

笔者心理也比较抗拒。

所以，先看几个例子。

例子

（1）

使用Type定义一个Hivetable，而且有一些Attribute：

Name:

hive_table

MetaType:

Class

SuperTypes:

DataSet

Attributes:

name:

String（nameofthetable）

db:

Databaseobjectoftypehive_db

owner:

String

createTime:

Date

lastAccessTime:

Date

comment:

String

retention:

int

sd:

StorageDescriptionobjectoftypehive_storagedesc

partitionKeys:

Arrayofobjectsoftypehive_column

aliases:

Arrayofstrings

columns:

Arrayofobjectsoftypehive_column

parameters:

MapofStringkeystoStringvalues

viewOriginalText:

String

viewExpandedText:

String

tableType:

String

temporary:

Boolean

这跟java类的定义很相似，也跟json数据定义类似。

需要注意的几点：

Atlas中的类型由“name”唯一标识

每个type具有一个metatype。

metatype表示该模型在Atlas中的类型。

Atlas有以下metatypes:

基本metatypes:

如Int、字符串、布尔值等。

枚举metatypes:

TODO

集合metatypes:

例如阵列、地图

复合metatypes:

如类、结构、特性

4.类型可以从名为“supertype”的父类型“extend”。

凭借这一点,它将得到还包括在超类型中定义的属性。

这使模型设计家以在一组相关类型中定义公共属性等。

这再次类似于面向对象语言如何定类的超级类的概念。

在本示例中,每个配置单元表都从预定义的超类型称为”DataSet”。

有关此预定义的更多详细信息类型将在以后提供。

在Atlas中的类型也可以从多个超级类型扩展。

5.具有“Class”、”Struct”或“Trait”metatype的类型可以有一个集合

属性。

每个属性都有一个名称（例如“name”）和其他一些关联的

性能。

属性可以引用为使用表达式。

从上面的说明看，atlastype似乎具有和java中class类似的性质，比如继承。

如果我们按照java中对象关系的角度理解，会更容易理解一些。

例子

（2）

一个Entity的定义：

id:

"9ba387ddfa76429cb791ffc338d3c91f"

typeName:

“hive_table”

values:

name:

“customers”

db:

"b42c6cfcc1e742fda9e6890e0adf33bc"

owner:

“admin”

createTime:

"20160620T06:

13:

28.000Z"

lastAccessTime:

"20160620T06:

13:

28.000Z"

comment:

null

retention:

sd:

"ff58025f685441959f753a3058dd8dcf"

partitionKeys:

null

aliases:

null

columns:

["65e2204f6a234130934a9679af6a211f",

"d726de70faca46fb9c99cf04f6b579a6",

...]

parameters:

{"transient_lastDdlTime":

"1466403208"}

viewOriginalText:

null

viewExpandedText:

null

tableType:

“MANAGED_TABLE”

temporary:

false

上面的id就是Entity的id。

顺着java对象的思路，Entity结构也是比较容易理解的。

例子（3）

常用api：

（1）获取所有的types

GEThttp:

//atlasserverhost:

port/api/atlas/types

GEThttp:

//atlasserverhost:

port/api/atlas/types?

type=STRUCT|CLASS|TRAIT

Response

{

"results":

[

"Asset",

"hive_column",

"Process",

"storm_node",

"storm_bolt",

"falc

展开阅读全文