我用Python爬取了女神视界,爬虫之路永无止境「内附源码」
我发现抖音上很多小姐姐就拍个跳舞的视频就火了,各人是冲着舞蹈水平去的吗,都是冲着颜值身材去的,能刷到这篇文章的都是lsp了,我就跟各人不一样了,一个个刷太麻烦了,我直接爬下来看个够,先随意展示两个。https://p6.toutiaoimg.com/large/pgc-image/3fb9750f5e0e44659c4d3f88412328a7
https://p6.toutiaoimg.com/large/pgc-image/d94f6af09a2b4d0b966861a73a5d1614
采集目标
爬取目标:女神世界
https://p6.toutiaoimg.com/large/pgc-image/1cb47988aaa446418ab737b07439c4b8
https://p26.toutiaoimg.com/large/pgc-image/f904b7c966b74dbf92c2f1802d8403e7
结果展示
https://p26.toutiaoimg.com/large/pgc-image/87c82ac7bbf748bc927b5f72925d8a64
https://p3.toutiaoimg.com/large/pgc-image/8628ff4bca984c42bac4b3049ad53660
工具使用
使用环境:Python3.7 工具:pycharm 第三方库:requests, re, pyquery
爬虫思路:
[*]获取的是视频数据 (16进制字节)
[*]在这个页面没有视频地址 需要进去详情页 所有需要从 视频播放页开始抓取
使用快捷键 F12 进入开发者控制台:
https://p6.toutiaoimg.com/large/pgc-image/38719ddd29fd4038b6479430dc04d660
https://p9.toutiaoimg.com/large/pgc-image/f03a13d115a14a2c995146ff2b2c65e5
先不急, 找到 视频地址 去搜刮他 看看在那里有包罗:
https://p9.toutiaoimg.com/large/pgc-image/a9cb2551347846118e49d5725ce0fe06
https://p5.toutiaoimg.com/large/pgc-image/e0541b2570a445c190b7b8895a638e4e
https://p9.toutiaoimg.com/large/pgc-image/42756d15b93741079788494609cac718
https://p26.toutiaoimg.com/large/pgc-image/9689c26003a0432c8a6571acbce249b6
定位他 发现是静态页面返回的数据:
https://p6.toutiaoimg.com/large/pgc-image/50533b1d6d3a4f36baf3afcfa50fac2a
https://p5.toutiaoimg.com/large/pgc-image/9065e058daf547d4856bf6df6aa0fd0a
上代码:
def Tools(url):# 封装一个工具函数 用来做哀求的 headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52' } response = requests.get(url, headers=headers) return responseurl = 'https://www.520mmtv.com/9614.html'response = Tools(url).textvideo_url = re.findall(r'url: "(.*?)",', response) # 正则表达式提取 视频地址video_content = Tools(video_url).content# 视频地址存储 需要在代码同路径 手动创建 短视频文件夹with open('./短视频/123.mp4', 'ab') as f: f.write(video_content)# 下载了一个
https://p3.toutiaoimg.com/large/pgc-image/a0fedcbf41d042c88871893b12d369fd
https://p26.toutiaoimg.com/large/pgc-image/147bdafde2e34a4d9f9d8ad49aad0908
https://p5.toutiaoimg.com/large/pgc-image/80dba0c753e149959c365838c1dbf354
https://p26.toutiaoimg.com/large/pgc-image/b1f21dba51ef45c3882ec62fd8263e8a
https://p3.toutiaoimg.com/large/pgc-image/027f68708b2a4e169820dc1630d84619
https://p9.toutiaoimg.com/large/pgc-image/d0d773c2226a44acbea9aeead0a893fe
def main(): url = 'https://www.520mmtv.com/hd/rewu.html' response = Tools(url).text doc = pq(response) # 创建pyquery对象 注意根据css的 class 类选择 和id选择器进行数据提取 i_list = doc('.i_list.list_n2.cxudy-list-formatvideo a').items() # .类选择器 中间有空格的 记得更换成. meta_title = doc('.meta-title').items() # 标题 for i, t in zip(i_list, meta_title): href = i.attr('href') Play(t.text(), href)
https://p3.toutiaoimg.com/large/pgc-image/64396cee2b174eae92f80b5701b3a0d5
全部代码:
import requestsimport refrom pyquery import PyQuery as pqdef Tools(url): headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52' } response = requests.get(url, headers=headers) return responsedef Play(title, url): # url = 'https://www.520mmtv.com/9614.html' response = Tools(url).text video_url = re.findall(r'url: "(.*?)",', response) video_content = Tools(video_url).content with open('./短视频/{}.mp4'.format(title), 'ab') as f: f.write(video_content) print('{}下载完成....'.format(title))def main(): url = 'https://www.520mmtv.com/hd/rewu.html' response = Tools(url).text doc = pq(response) # 创建pyquery对象 注意根据css的 class 类选择 和id选择器进行数据提取 i_list = doc('.meta-title').items() # .类选择器 中间有空格的 记得更换成. meta_title = doc('.meta-title').items() # 标题 for i, t in zip(i_list, meta_title): href = i.attr('href') Play(t.text(), href)if __name__ == '__main__': main()
https://p26.toutiaoimg.com/large/pgc-image/2ed792ac37534b12bb2674f151ffe75f
下载比较慢网络不好,你网快的话 ,就下载快。
结果:
https://p5.toutiaoimg.com/large/pgc-image/078235b673114b3892995322cd368390
页:
[1]