Commit 41dc969c authored by 段英荣's avatar 段英荣

Initial commit

parents
MIT License
Copyright (c) 2019 zkqiang
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
![pic](https://github.com/zkqiang/Zhihu-Login/blob/master/docs/0.jpg)
# 2019年最新 Python 模拟登录知乎 支持验证码和保存 Cookies
> 知乎的登录页面已经改版多次,加强了身份验证,网络上大部分模拟登录均已失效,所以我重写了一份完整的,并实现了提交验证码 (包括中文验证码),本文我对分析过程和代码进行步骤分解,完整的代码请见末尾 Github 仓库,不过还是建议看一遍正文,因为代码早晚会失效,解析思路才是永恒。
## 分析 POST 请求
首先打开控制台正常登录一次,可以很快找到登录的 API 接口,这个就是模拟登录 POST 的链接。
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/1.jpg" width=600 align=center alt="操作前不要忘记勾选上面的 Preserve log">
我们的最终目标是构建 POST 请求所需的 Headers 和 Form-Data 这两个对象即可。
## 构建 Headers
继续看`Requests Headers`信息,和登录页面的 GET 请求对比发现,这个 POST 的头部多了三个身份验证字段,经测试`x-xsrftoken`是必需的。
`x-xsrftoken`则是防 Xsrf 跨站的 Token 认证,访问首页时从`Response Headers``Set-Cookie`字段中可以找到。
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/2.jpg" width=600 align=center alt="注意只有无Cookies请求才能看到">
## 构建 Form-Data
Form部分目前已经是加密的,无法再直观看到,可以通过在 JS 里打断点的方式(具体这里不再赘述,如不会打断点请自行搜索)。
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/6.jpg" width=600 align=center alt="打断点的位置">
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/7.jpg" width=600 align=center alt="Request Payload 信息">
然后我们逐个构建上图这些参数:
`timestamp` 时间戳,这个很好解决,区别是这里是13位整数,Python 生成的整数部分只有10位,需要额外乘以1000
```
timestamp = str(int(time.time()*1000))
```
`signature` 通过 Ctrl+Shift+F 搜索找到是在一个 JS 里生成的,是通过 Hmac 算法对几个固定值和时间戳进行加密,那么只需要在 Python 里也模拟一次这个加密即可。
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/3.jpg" width=600 align=center alt="Python 内置 Hmac 函数,非常方便">
```python
def _get_signature(self, timestamp):
ha = hmac.new(b'd1b964811afb40118a12068ff74a12f4', digestmod=hashlib.sha1)
grant_type = self.login_data['grant_type']
client_id = self.login_data['client_id']
source = self.login_data['source']
ha.update(bytes((grant_type + client_id + source + timestamp), 'utf-8'))
return ha.hexdigest()
```
`captcha` 验证码,是通过 GET 请求单独的 API 接口返回是否需要验证码(无论是否需要,都要请求一次),如果是 True 则需要再次 PUT 请求获取图片的 base64 编码。
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/4.jpg" width=600 align=center alt="将 base64 解码并写成图片文件即可">
```python
resp = self.session.get(api, headers=headers)
show_captcha = re.search(r'true', resp.text)
if show_captcha:
put_resp = self.session.put(api, headers=headers)
json_data = json.loads(put_resp.text)
img_base64 = json_data['img_base64'].replace(r'\n', '')
with open('./captcha.jpg', 'wb') as f:
f.write(base64.b64decode(img_base64))
img = Image.open('./captcha.jpg')
```
实际上有两个 API,一个是识别倒立汉字,一个是常见的英文验证码,任选其一即可,代码中我将两个都实现了,汉字是通过 plt 点击坐标,然后转为 JSON 格式。(另外,这里其实可以通过重新请求登录页面避开验证码,如果你需要自动登录的话可以改造试试)
最后还有一点要注意,如果有验证码,需要将验证码的参数先 POST 到验证码 API,再随其他参数一起 POST 到登录 API。
```python
if lang == 'cn':
import matplotlib.pyplot as plt
plt.imshow(img)
print('点击所有倒立的汉字,按回车提交')
points = plt.ginput(7)
capt = json.dumps({'img_size': [200, 44],
'input_points': [[i[0]/2, i[1]/2] for i in points]})
else:
img.show()
capt = input('请输入图片里的验证码:')
# 这里必须先把参数 POST 验证码接口
self.session.post(api, data={'input_text': capt}, headers=headers)
return capt
```
<img src="https://github.com/zkqiang/Zhihu-Login/blob/master/docs/5.jpg" width=600 align=center alt="和正常登录传递的参数一模一样">
然后把 username 和 password 两个值更新进去,其他字段都保持固定值即可。
```python
self.login_data.update({
'username': self.username,
'password': self.password,
'lang': captcha_lang
})
timestamp = int(time.time()*1000)
self.login_data.update({
'captcha': self._get_captcha(self.login_data['lang']),
'timestamp': timestamp,
'signature': self._get_signature(timestamp)
})
```
## 加密 Form-Data
但是现在知乎必须先将 Form-Data 加密才能进行 POST 传递,所以我们还要解决加密问题,可由于我们看到的 JS 是混淆后的代码,想窥视其中的加密实现方式是一件很费精力的事情。
所以这里我采用了 sergiojune 这位知友通过 `pyexecjs` 调用 JS 进行加密的方式,只需要把混淆代码完整复制过来,稍作修改即可。
具体可看他的原文:https://zhuanlan.zhihu.com/p/57375111
```python
with open('./encrypt.js') as f:
js = execjs.compile(f.read())
return js.call('Q', urlencode(form_data))
```
这里也感谢他分享了一些坑,不然确实不好解决。
## 保存 Cookies
最后实现一个检查登录状态的方法,如果访问登录页面出现跳转,说明已经登录成功,这时将 Cookies 保存起来(这里 session.cookies 初始化为 LWPCookieJar 对象,所以有 save 方法),这样下次登录可以直接读取 Cookies 文件。
```python
def check_login(self):
resp = self.session.get(self.login_url, allow_redirects=False)
if resp.status_code == 302:
self.session.cookies.save()
return True
return False
```
## 完整代码
https://github.com/zkqiang/Zhihu-Login/blob/master/zhihu_login.py
## 运行环境
* Python 3
* requests
* matplotlib
* pillow
## 微信公众号
![pic](https://github.com/zkqiang/Zhihu-Login/blob/master/docs/wx.jpg)
新开了微信公众号:面向人生编程
编程思维不应只存留在代码之中,更应伴随于整个人生旅途,所以公众号里不只聊技术,还会聊产品/互联网/经济学等广泛话题,所以也欢迎非程序员关注。
#LWP-Cookies-2.0
Set-Cookie3: _xsrf=EiQoq4fwTOyBpSKxnMzbuOdDitSQt39E; path="/"; domain=".zhihu.com"; path_spec; expires="2022-06-25 08:49:14Z"; version=0
Set-Cookie3: _zap="2e65edfb-d5cb-4af1-ac23-ac1d4afb7c13"; path="/"; domain=".zhihu.com"; path_spec; domain_dot; expires="2022-01-06 08:50:09Z"; version=0
Set-Cookie3: capsion_ticket="\"2|1:0|10:1578386954|14:capsion_ticket|44:OWM0ZTcyN2I5MjYzNGJmZDhhZjBhODU5ZWJhYzY0Yzg=|c9f22e0ec3638f8fa1ae386ab25715c5cffcbfa3385a7dd40bb9c40d65102e3d\""; path="/"; domain=".zhihu.com"; path_spec; expires="2020-02-06 08:49:14Z"; httponly=None; version=0
Set-Cookie3: z_c0="\"2|1:0|10:1578387010|4:z_c0|80:MS4xd0JwakRnQUFBQUFtQUFBQVlBSlZUVUtVQVZfYjhyODhuZDExWUZ2WXl4TE1aenRLUlQ1ZkxnPT0=|ab175e0ed514849acbb4f43c1a08b29801f137d899ee687e884b5ed4d74b37b7\""; path="/"; domain=".zhihu.com"; path_spec; secure; expires="2020-07-05 08:50:10Z"; httponly=None; version=0
docs/0.jpg

50.2 KB

docs/1.jpg

60.1 KB

docs/5.jpg

13.3 KB

docs/wx.jpg

45.4 KB

This diff is collapsed.
Pillow >= 5.0.0
matplotlib >= 2.1.2
requests >= 2.18.4
pyexecjs >= 1.5.1
# -*- coding: utf-8 -*-
#coding=gbk
import threading
__author__ = 'zkqiang'
__zhihu__ = 'https://www.zhihu.com/people/z-kqiang'
__github__ = 'https://github.com/zkqiang/Zhihu-Login'
from bs4 import BeautifulSoup
import base64
import hashlib
import hmac
import json
import re
import time
from http import cookiejar
from urllib.parse import urlencode
import execjs
import requests
from PIL import Image
import sys
from scrapy.selector import Selector
import brotli
class ZhihuAccount(object):
def __init__(self, username: str = None, password: str = None):
self.username = username
self.password = password
self.login_data = {
'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',
'grant_type': 'password',
'source': 'com.zhihu.web',
'username': '',
'password': '',
'lang': 'en',
'ref_source': 'homepage',
'utm_source': ''
}
self.session = requests.session()
self.session.headers = {
'accept-encoding': 'gzip, deflate, br',
'Host': 'www.zhihu.com',
'Referer': 'https://www.zhihu.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
self.session.cookies = cookiejar.LWPCookieJar(filename='./cookies.txt')
def login(self, captcha_lang: str = 'en', load_cookies: bool = True):
"""
模拟登录知乎
:param captcha_lang: 验证码类型 'en' or 'cn'
:param load_cookies: 是否读取上次保存的 Cookies
:return: bool
若在 PyCharm 下使用中文验证出现无法点击的问题,
需要在 Settings / Tools / Python Scientific / Show Plots in Toolwindow,取消勾选
"""
if load_cookies and self.load_cookies():
print('读取 Cookies 文件')
if self.check_login():
print('登录成功')
return True
print('Cookies 已过期')
self._check_user_pass()
self.login_data.update({
'username': self.username,
'password': self.password,
'lang': captcha_lang
})
timestamp = int(time.time() * 1000)
self.login_data.update({
'captcha': self._get_captcha(self.login_data['lang']),
'timestamp': timestamp,
'signature': self._get_signature(timestamp)
})
headers = self.session.headers.copy()
headers.update({
'content-type': 'application/x-www-form-urlencoded',
'x-zse-83': '3_1.1',
'x-xsrftoken': self._get_xsrf()
})
data = self._encrypt(self.login_data)
login_api = 'https://www.zhihu.com/api/v3/oauth/sign_in'
resp = self.session.post(login_api, data=data, headers=headers)
if 'error' in resp.text:
print(json.loads(resp.text)['error'])
if self.check_login():
print('登录成功')
return True
print('登录失败')
return False
def load_cookies(self):
"""
读取 Cookies 文件加载到 Session
:return: bool
"""
try:
self.session.cookies.load(ignore_discard=True)
return True
except FileNotFoundError:
return False
def check_login(self):
"""
检查登录状态,访问登录页面出现跳转则是已登录,
如登录成功保存当前 Cookies
:return: bool
"""
login_url = 'https://www.zhihu.com/signup'
resp = self.session.get(login_url, allow_redirects=False)
if resp.status_code == 302:
self.session.cookies.save()
return True
return False
def _get_xsrf(self):
"""
从登录页面获取 xsrf
:return: str
"""
self.session.get('https://www.zhihu.com/', allow_redirects=False)
for c in self.session.cookies:
if c.name == '_xsrf':
return c.value
raise AssertionError('获取 xsrf 失败')
def _get_captcha(self, lang: str):
"""
请求验证码的 API 接口,无论是否需要验证码都需要请求一次
如果需要验证码会返回图片的 base64 编码
根据 lang 参数匹配验证码,需要人工输入
:param lang: 返回验证码的语言(en/cn)
:return: 验证码的 POST 参数
"""
if lang == 'cn':
api = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=cn'
else:
api = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=en'
resp = self.session.get(api)
show_captcha = re.search(r'true', resp.text)
if show_captcha:
put_resp = self.session.put(api)
json_data = json.loads(put_resp.text)
img_base64 = json_data['img_base64'].replace(r'\n', '')
with open('./captcha.jpg', 'wb') as f:
f.write(base64.b64decode(img_base64))
img = Image.open('./captcha.jpg')
if lang == 'cn':
import matplotlib.pyplot as plt
plt.imshow(img)
print('点击所有倒立的汉字,在命令行中按回车提交')
points = plt.ginput(7)
capt = json.dumps({'img_size': [200, 44],
'input_points': [[i[0] / 2, i[1] / 2] for i in points]})
else:
img_thread = threading.Thread(target=img.show, daemon=True)
img_thread.start()
capt = input('请输入图片里的验证码:')
# 这里必须先把参数 POST 验证码接口
self.session.post(api, data={'input_text': capt})
return capt
return ''
def _get_signature(self, timestamp: int or str):
"""
通过 Hmac 算法计算返回签名
实际是几个固定字符串加时间戳
:param timestamp: 时间戳
:return: 签名
"""
ha = hmac.new(b'd1b964811afb40118a12068ff74a12f4', digestmod=hashlib.sha1)
grant_type = self.login_data['grant_type']
client_id = self.login_data['client_id']
source = self.login_data['source']
ha.update(bytes((grant_type + client_id + source + str(timestamp)), 'utf-8'))
return ha.hexdigest()
def _check_user_pass(self):
"""
检查用户名和密码是否已输入,若无则手动输入
"""
if not self.username:
self.username = input('请输入手机号:')
if self.username.isdigit() and '+86' not in self.username:
self.username = '+86' + self.username
if not self.password:
self.password = input('请输入密码:')
@staticmethod
def _encrypt(form_data: dict):
with open('./encrypt.js') as f:
js = execjs.compile(f.read())
return js.call('Q', urlencode(form_data))
# 知乎个人文章列表
def test_member_article(self):
member_article_url = "https://www.zhihu.com/api/v4/members/li-pei-rong-96/articles?include=data%5B*%5D.comment_count%2Csuggest_edit%2Cis_normal%2Cthumbnail_extra_info%2Cthumbnail%2Ccan_comment%2Ccomment_permission%2Cadmin_closed_comment%2Ccontent%2Cvoteup_count%2Ccreated%2Cupdated%2Cupvoted_followees%2Cvoting%2Creview_info%2Cis_labeled%2Clabel_info%3Bdata%5B*%5D.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=40&limit=20&sort_by=created"
resp = self.session.get(member_article_url, allow_redirects=False)
print(10*"*")
raw_content = brotli.decompress(resp.content)
print(type(raw_content))
content_dict = json.loads(str(raw_content,encoding="utf-8"))
for item in content_dict["data"]:
print(item["title"])
print(item["content"])
print(50*"*")
# 知乎搜索词搜索
def zhihu_query_by_word(self,query_word):
query_by_word_url = "https://www.zhihu.com/api/v4/search_v3?t=general&q=%E8%B6%85%E5%A3%B0%E5%88%80&correction=1&offset=0&limit=20&lc_idx=62&show_all_topics=0&search_hash_id=1dbb1e923a17f147356177932d1236e1&vertical_info=0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C1"
return
if __name__ == '__main__':
account = ZhihuAccount('', '')
account.login(captcha_lang='en', load_cookies=True)
account.test_member_article()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment