python远程连接hive_windows下怎么用python连接hive数据库

‘壹’ python连接hive的时候必须要依赖sasl类库吗

客户端连接Hive需要使用HiveServer2。HiveServer2是HiveServer的重写版本，HiveServer不支持多个客户端的并发请求。当前HiveServer2是基于Thrift RPC实现的。它被设计用于为像JDBC、ODBC这样的开发API客户端提供更好的支持。Hive 0.11版本引入的HiveServer2。

HiveServer2的启动

启动HiveServer2

HiveServer2的启动十分简便：

$ $HIVE_HOME/bin/hiveserver2

或者

$ $HIVE_HOME/bin/hive --service hiveserver2

默认情况下，HiverServer2的Thrift监听端口是10000，其WEB UI端口是10002。可通过来查看HiveServer2的Web UI界面，这里显示了Hive的一些基本信息。如果Web界面不能查看，则说明HiveServer2没有成功运行。

使用beeline测试客户端连接

HiveServer2成功运行后，我们可以使用Hive提供的客户端工具beeline连接HiveServer2。

$ $HIVE_HOME/bin/beeline

beeline > !connect jdbc:hive2://localhost:10000

如果成功登录将出现如下的命令提示符，此时可以编写HQL语句。

0: jdbc:hive2://localhost:10000>

报错：User: xxx is not allowed to impersonate anonymous

在beeline使用!connect连接HiveServer2时可能会出现如下错误信息：

12Caused by: org.apache.hadoop.ipc.RemoteException:User: xxx is not allowed to impersonate anonymous

这里的xxx是我的操作系统用户名称。这个问题的解决方法是在hadoop的core-size.xml文件中添加xxx用户代理配置：

123456789<spanclass="hljs-tag"><<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">name>hadoop.proxyuser.xxx.groups<spanclass="hljs-tag"></<spanclass="hljs-title">name><spanclass="hljs-tag"><<spanclass="hljs-title">value>*<spanclass="hljs-tag"></<spanclass="hljs-title">value><spanclass="hljs-tag"></<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">name>hadoop.proxyuser.xxx.hosts<spanclass="hljs-tag"></<spanclass="hljs-title">name><spanclass="hljs-tag"><<spanclass="hljs-title">value>*<spanclass="hljs-tag"></<spanclass="hljs-title">value><spanclass="hljs-tag"></<spanclass="hljs-title">property>

重启HDFS后，再用beeline连接HiveServer2即可成功连接。

常用配置

HiveServer2的配置可以参考官方文档《Setting Up HiveServer2》

这里列举一些hive-site.xml的常用配置：

hive.server2.thrift.port：监听的TCP端口号。默认为10000。

hive.server2.thrift.bind.host：TCP接口的绑定主机。

hive.server2.authentication：身份验证方式。默认为NONE（使用 plain SASL），即不进行验证检查。可选项还有NOSASL, KERBEROS, LDAP, PAM and CUSTOM.

hive.server2.enable.doAs：是否以模拟身份执行查询处理。默认为true。

Python客户端连接HiveServer2

python中用于连接HiveServer2的客户端有3个：pyhs2，pyhive，impyla。官网的示例采用的是pyhs2，但pyhs2的官网已声明不再提供支持，建议使用impyla和pyhive。我们这里使用的是impyla。

impyla的安装

impyla必须的依赖包括：

six
bit_array
thriftpy(python2.x则是thrift)

为了支持Hive还需要以下两个包：

sasl
thrift_sasl

可在Python PI中下载impyla及其依赖包的源码。

impyla示例

以下是使用impyla连接HiveServer2的示例：

‘贰’ windows下怎么用python连接hive数据库

由于版本的不同，Python 连接 Hive 的方式也就不一样。
在网上搜索关键字 python hive 的时候可以找到一些解决方案。大部分是这样的，首先把hive 根目录下的$HIVE_HOME/lib/py拷贝到 python 的库中，也就是 site-package 中，或者干脆把新写的 python 代码和拷贝的 py 库放在同一个目录下，然后用这个目录下提供的 thrift 接口调用。示例也是非常简单的。类似这样：
import sys
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

def hiveExe(sql):

try:
transport = TSocket.TSocket('127.0.0.1', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()

client.execute(sql)

print "The return value is : "
print client.fetchAll()
print "............"
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)

if __name__ == '__main__':
hiveExe("show tables")171819202122232425262728

或者是这样的：
#!/usr/bin/env python

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
transport = TSocket.TSocket('14.18.154.188', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = ThriftHive.Client(protocol)
transport.open()

client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
client.execute("SELECT * FROM test1")
while (1):
row = client.fetchOne()
if (row == None):
break
print rowve
client.execute("SELECT * FROM test1")
print client.fetchAll()

transport.close()

except Thrift.TException, tx:
print '%s' % (tx.message)

但是都解决不了问题，从 netstat 中查看可以发现 TCP 连接确实是建立了，但是不执行 hive 指令。也许就是版本的问题。
还是那句话，看各种中文博客不如看官方文档。
项目中使用的 hive 版本是0.13，此时此刻官网的最新版本都到了1.2.1了。中间间隔了1.2.0、1.1.0、1.0.0、0.14.0。但是还是参考一下官网的方法试试吧。
首先看官网的 setting up hiveserver2
可以看到启动 hiveserver2 可以配置最大最小线程数，绑定的 IP，绑定的端口，还可以设置认证方式。（之前一直不成功正式因为这个连接方式）然后还给了 python 示例代码。
import pyhs2

with pyhs2.connect(host='localhost',
port=10000,
authMechanism="PLAIN",
user='root',
password='test',
database='default') as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()

#Execute query
cur.execute("select * from table")

#Return column info from query
print cur.getSchema()

#Fetch table results
for i in cur.fetch():
print

在拿到这个代码的时候，自以为是的把认证信息给去掉了。然后运行发现跟之前博客里介绍的方法结果一样，建立了 TCP 连接，但是就是不执行，也不报错。这是几个意思？然后无意中尝试了一下原封不动的使用上面的代码。结果可以用。唉。。。
首先声明一下，hive-site.xml中默认关于 hiveserver2的配置我一个都没有修改，一直是默认配置启动 hiveserver2。没想到的是默认配置是有认证机制的。
然后再写一点，在安装 pyhs2的时候还是遇到了点问题，其实还是要看官方文档的，我只是没看官方文档直接用 pip安装导致了这个问题。安装 pyhs2需要确定已经安装了几个依赖包。直接看在 github 上的 wiki 吧。哪个没安装就补上哪一个就好了。
To install pyhs2 on a clean CentOS 6.4 64-bit desktop....

(as root or with sudo)

get ez_setup.py from https://pypi.python.org/pypi/ez_setup
python ez_setup.py
easy_install pip
yum install gcc-c++
yum install cyrus-sasl-devel.x86_64
yum install python-devel.x86_64
pip install

写了这么多，其实是在啰嗦自己遇到的问题。下面写一下如何使用 python
连接 hive。
python 连接 hive 是基于 thrift 完成的。所以需要服务器端和客户端的配合才能使用。
在服务器端需要启动 hiveserver2 服务，启动方法有两种，第二种方法只是对第一种方法的封装。
1. $HIVE_HOME/bin/hive --server hiveserver2
2. $HIVE_HOME/bin/hiveserver21212

默认情况下就是hiveserver2监听了10000端口。也可以通过修改 hive-site.xml 或者在启动的时候添加参数来实现修改默认配置。
另外一方面，在客户端需要安装 python 的依赖包 pyhs2。安装方法在上面也介绍了，基本上就是用 pip install pyhs2，如果安装不成功，安装上面提到的依赖包就可以了。
最后运行上面的示例代码就可以了，配置好 IP 地址、端口、数据库、表名称就可以用了，默认情况下认证信息不需要修改。
另外补充一点 fetch 函数执行速度是比较慢的，会把所有的查询结果返回来。可以看一下 pyhs2 的源码，查看一下还有哪些函数可以用。下图是 Curor 类的可以使用的函数。

一般 hive 表里的数据比较多，还是一条一条的读比较好，所以选择是哟功能 fetchone函数来处理数据。fetchone函数如果读取成功会返回列表，否则 None。可以把示例代码修改一下，把 fetch修改为：
count = 0
while (1):
row = cur.fetchone()
if (row is not None):
count += 1
print count, row
else:
print "it's over"

‘叁’ windows下怎么用python连接hive数据库

MySQLdb.connect是python 连接MySQL数据库的方法，在Python中 import MySQLdb即可使用，至于connect中的参数很简单： host：MySQL服务器名 user：数据库使用者 password：用户登录密码 db：操作的数据库名 charset：使用的字符集(一般是gb2312)

‘肆’ python 连接hive后处理导出excel 问题

你的原始数据里面有空值，因此导致的错误，在写入或者读取之前填充以下缺失值，或者先对要写入或者读取的数据判断下是否为空，再做操作。
要不然你就加入try except，来主动跳过

‘伍’ Python 连接hive（Linux）

之所以选择基于Linux系统用Python连接hive，是因为在window下会出现Hadoop认证失败的问题。会出现执行python脚本的机器无目标hive的kerberos认证信息类似错误，也会出现sasl调用问题：

该错误我尝试多次，未能解决（有知道window下解决方案的欢迎留言），所以建议使用Linux系统。

VMware Workstation +Ubuntu

网上教程很多，本文推荐一个教程： https://blog.csdn.net/stpeace/article/details/78598333

主要是以下四个包：

在安装包sasl的过程会出现麻烦，主要是Ubuntu中缺乏sasl.h的问题，这里可以通过下面语句解决

这和centos有一些区别。

本文是基于本机虚拟机用Python连接的公司测试环境的hive（生产环境和测试环境是有隔离的，生产环境需要堡垒机才能连接）

因缺乏工程和计算机基础的知识，对很多的地方都了解的不够深入，欢迎大神指点，最后向以下两位大佬的帖子致谢：
[1] https://www.hu.com/question/269333988/answer/581126392
[2] https://mp.weixin.qq.com/s/cdFxkphMtJASQ7-nKt13mg

‘陆’ 关于python利用thrift远程连接hive的问题

你起的thrift服务确定启好了吗你先在服务器上看下IP端口是不是开了，而且IP不是Localhost的如果好了远程肯定可以连上。

‘柒’ python 访问 hive pyhs2 端口号是多少

2、JDBC连接的方式，当然还有其他的连接方式，比如ODBC等，这种方式很常用，可以在网上随便找到，就不再累赘了。不稳定，经常会被大数据量冲挂，不建议使用。 3、这种方式是直接利用Hive的 Driver class 来直接连接，感觉这种方式不通过JDBC，应该速度会比较快一点（未经验证）。我只是在local模式下测试过。

‘捌’ windows下怎么用python连接hive数据库

#!/usr/bin/python2.7
#hive--servicehiveserver>/dev/null2>/dev/null&
#/opt/cloudera/parcels/CDH/lib/hive/lib/pyimportsys

#python与hiveserver交互
sys.path.append('C:/hadoop_jar/py')
fromhive_serviceimportThriftHive
fromhive_service.
fromthrift.transportimportTSocket
fromthriftimportThrift
fromthrift.transportimportTTransport
fromthrift.protocolimportTBinaryProtocol

if__name__=='__main__':
try:
socket=TSocket.TSocket('10.70.50.111',10000)
transport=TTransport.TBufferedTransport(socket)
protocol=TBinaryProtocol.TBinaryProtocol(transport)
client=ThriftHive.Client(protocol)
sql='select*fromtest'
transport.open()
client.execute(sql)
withopen('C:/Users/DWJ/Desktop/python2hive.txt','w')asout_file:
whileclient.fetchOne():
out_file.write(client.fetchOne())
transport.close()
exceptThrift.TException,tx:
print'%s'%(tx.message)

其中，C:/hadoop_jar/py里的包来自于hive安装文件自带的py，如：/opt/cloudera/parcels/CDH/lib/hive/lib/py，将其添加到python中即可。

‘玖’ hive的几种连接方式

hive在客户端除了直接执行hive命令连接外，还可以利用beeline连接，常用到的就是以下三种：

1.beeline直接连接：

beeline -u jdbc:hive2://192.168.188.100:10000 -n wind(用户名)

2.beeline的参数化连接

hiveserver2_url="jdbc:hive2://192.168.188.100:10000 -n wind(用户名)"

beeline -u ${hiveserver2_url} -f /home/hadoop/app/shell/hive/ --hivevar v_data=value;

3.beeline的高可用性连接

beeline -u "jdbc:hive2://192.168.188.100:2181,192.168.188.101:2181,192.168.188.102:2181,192.168.188.103:2181/;serviceDiscoveryMode=zookeeper;zookeeperNmaespace=hiveserver2 -n wind(用户名)"

4.beeline的有权限的高可用连接

beeline -u "jdbc:hive2://dn02.hadoop.cn:2181,dn01.hadoop.cn:2181,dn03.hadoop.cn:2181/devportaldemo;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;auth=kerberos;principal=hive/[email protected]?maprece.job.queuename=0122a8ed-08e0-4945-acb7-d04f910b196c"

‘拾’ python连接hive，怎么安装thrifthive