captcha源碼_如何利用Python做簡單的驗證碼識別

① 怎樣用python設計一個爬蟲模擬登陸知乎

給你一個例子，可以看看：

import requests
import time
import json
import os
import re
import sys
import subprocess
from bs4 import BeautifulSoup as BS

class ZhiHuClient(object):

"""連接知乎的工具類，維護一個Session
2015.11.11

用法：

client = ZhiHuClient()

# 第一次使用時需要調用此方法登錄一次，生成cookie文件
# 以後可以跳過這一步
client.login("username", "password")

# 用這個session進行其他網路操作，詳見requests庫
session = client.getSession()
"""

# 網址參數是賬號類型
TYPE_PHONE_NUM = "phone_num"
TYPE_EMAIL = "email"
loginURL = r"http://www.hu.com/login/{0}"
homeURL = r"http://www.hu.com"
captchaURL = r"http://www.hu.com/captcha.gif"

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Host": "www.hu.com",
"Upgrade-Insecure-Requests": "1",
}

captchaFile = os.path.join(sys.path[0], "captcha.gif")
cookieFile = os.path.join(sys.path[0], "cookie")

def __init__(self):
os.chdir(sys.path[0]) # 設置腳本所在目錄為當前工作目錄

self.__session = requests.Session()
self.__session.headers = self.headers # 用self調用類變數是防止將來類改名
# 若已經有 cookie 則直接登錄
self.__cookie = self.__loadCookie()
if self.__cookie:
print("檢測到cookie文件，直接使用cookie登錄")
self.__session.cookies.update(self.__cookie)
soup = BS(self.open(r"http://www.hu.com/").text, "html.parser")
print("已登陸賬號： %s" % soup.find("span", class_="name").getText())
else:
print("沒有找到cookie文件，請調用login方法登錄一次！")

# 登錄
def login(self, username, password):
"""
驗證碼錯誤返回：
{'errcode': 1991829, 'r': 1, 'data': {'captcha': '請提交正確的驗證碼 :('}, 'msg': '請提交正確的驗證碼 :('}
登錄成功返回：
{'r': 0, 'msg': '登陸成功'}
"""
self.__username = username
self.__password = password
self.__loginURL = self.loginURL.format(self.__getUsernameType())
# 隨便開個網頁，獲取登陸所需的_xsrf
html = self.open(self.homeURL).text
soup = BS(html, "html.parser")
_xsrf = soup.find("input", {"name": "_xsrf"})["value"]
# 下載驗證碼圖片
while True:
captcha = self.open(self.captchaURL).content
with open(self.captchaFile, "wb") as output:
output.write(captcha)
# 人眼識別
print("=" * 50)
print("已打開驗證碼圖片，請識別！")
subprocess.call(self.captchaFile, shell=True)
captcha = input("請輸入驗證碼：")
os.remove(self.captchaFile)
# 發送POST請求
data = {
"_xsrf": _xsrf,
"password": self.__password,
"remember_me": "true",
self.__getUsernameType(): self.__username,
"captcha": captcha
}
res = self.__session.post(self.__loginURL, data=data)
print("=" * 50)
# print(res.text) # 輸出腳本信息，調試用
if res.json()["r"] == 0:
print("登錄成功")
self.__saveCookie()
break
else:
print("登錄失敗")
print("錯誤信息 --->", res.json()["msg"])

def __getUsernameType(self):
"""判斷用戶名類型
經測試，網頁的判斷規則是純數字為phone_num，其他為email
"""
if self.__username.isdigit():
return self.TYPE_PHONE_NUM
return self.TYPE_EMAIL

def __saveCookie(self):
"""cookies 序列化到文件
即把dict對象轉化成字元串保存
"""
with open(self.cookieFile, "w") as output:
cookies = self.__session.cookies.get_dict()
json.mp(cookies, output)
print("=" * 50)
print("已在同目錄下生成cookie文件：", self.cookieFile)

def __loadCookie(self):
"""讀取cookie文件，返回反序列化後的dict對象，沒有則返回None"""
if os.path.exists(self.cookieFile):
print("=" * 50)
with open(self.cookieFile, "r") as f:
cookie = json.load(f)
return cookie
return None

def open(self, url, delay=0, timeout=10):
"""打開網頁，返回Response對象"""
if delay:
time.sleep(delay)
return self.__session.get(url, timeout=timeout)

def getSession(self):
return self.__session

if __name__ == '__main__':
client = ZhiHuClient()

# 第一次使用時需要調用此方法登錄一次，生成cookie文件
# 以後可以跳過這一步
# client.login("username", "password")

# 用這個session進行其他網路操作，詳見requests庫
session = client.getSession()

② 如何利用Python做簡單的驗證碼識別

1摘要

驗證碼是目前互聯網上非常常見也是非常重要的一個事物，充當著很多系統的防火牆功能，但是隨時OCR技術的發展，驗證碼暴露出來的安全問題也越來越嚴峻。本文介紹了一套字元驗證碼識別的完整流程，對於驗證碼安全和OCR識別技術都有一定的借鑒意義。

然後經過了一年的時間，筆者又研究和get到了一種更強大的基於CNN卷積神經網路的直接端到端的驗證識別技術（文章不是我的，然後我把源碼整理了下，介紹和源碼在這裡面）：

基於python語言的tensorflow的『端到端』的字元型驗證碼識別源碼整理(github源碼分享)

2關鍵詞

關鍵詞：安全,字元圖片,驗證碼識別,OCR,Python,SVM,PIL

3免責聲明

本文研究所用素材來自於某舊Web框架的網站完全對外公開的公共圖片資源。

本文只做了該網站對外公開的公共圖片資源進行了爬取，並未越權做任何多餘操作。

本文在書寫相關報告的時候已經隱去漏洞網站的身份信息。

本文作者已經通知網站相關人員此系統漏洞，並積極向新系統轉移。

本報告的主要目的也僅是用於OCR交流學習和引起大家對驗證安全的警覺。

4引言

關於驗證碼的非技術部分的介紹，可以參考以前寫的一篇科普類的文章：

互聯網安全防火牆（1）--網路驗證碼的科普

裡面對驗證碼的種類，使用場景，作用，主要的識別技術等等進行了講解，然而並沒有涉及到任何技術內容。本章內容則作為它的技術補充來給出相應的識別的解決方案，讓讀者對驗證碼的功能及安全性問題有更深刻的認識。

5基本工具

要達到本文的目的，只需要簡單的編程知識即可，因為現在的機器學習領域的蓬勃發展，已經有很多封裝好的開源解決方案來進行機器學習。普通程序員已經不需要了解復雜的數學原理，即可以實現對這些工具的應用了。

主要開發環境：

python3.5
python SDK版本
PIL
圖片處理庫
libsvm
開源的svm機器學習庫

關於環境的安裝，不是本文的重點，故略去。

6基本流程

一般情況下，對於字元型驗證碼的識別流程如下：

准備原始圖片素材
圖片預處理
圖片字元切割
圖片尺寸歸一化
圖片字元標記
字元圖片特徵提取
生成特徵和標記對應的訓練數據集
訓練特徵標記數據生成識別模型
使用識別模型預測新的未知圖片集
達到根據「圖片」就能返回識別正確的字元集的目標

7素材准備

7.1素材選擇

由於本文是以初級的學習研究目的為主，要求「有代表性，但又不會太難」，所以就直接在網上找個比較有代表性的簡單的字元型驗證碼（感覺像在找漏洞一樣）。

最後在一個比較舊的網站（估計是幾十年前的網站框架）找到了這個驗證碼圖片。

原始圖：

def get_feature(img): """

獲取指定圖片的特徵值,

1. 按照每排的像素點,高度為10,則有10個維度,然後為6列,總共16個維度

:param img_path:

:return:一個維度為10（高度）的列表 """

width, height = img.size

pixel_cnt_list = []

height = 10 for y in range(height):

pix_cnt_x = 0 for x in range(width): if img.getpixel((x, y)) == 0: # 黑色點

pix_cnt_x += 1

pixel_cnt_list.append(pix_cnt_x) for x in range(width):

pix_cnt_y = 0 for y in range(height): if img.getpixel((x, y)) == 0: # 黑色點

pix_cnt_y += 1

pixel_cnt_list.append(pix_cnt_y) return pixel_cnt_list

然後就將圖片素材特徵化，按照libSVM指定的格式生成一組帶特徵值和標記值的向量文

導航:首頁 > 源碼編譯 > captcha源碼

captcha源碼

與captcha源碼相關的資料