Python爬虫实战，手把手讲授实现爬取网站漫画附带源码,线报资讯,创意电子

不想敲代码的程序员 发表于 2021-11-2 17:17:21

Python爬虫实战，手把手讲授实现爬取网站漫画附带源码

开辟工具

Python版本： 3.6.4
相关模块：
requests模块；
re模块；
shutil模块；
以及一些Python自带的模块。
环境搭建

安装Python并添加到环境变量，pip安装需要的相关模块即可。
思绪分析

漫画实在是一张一张图片来着，所以我们先找到这些图片的链接在那边！因为本文是为了实现想看什么漫画就爬取什么漫画，所以搜索任一漫画，这里以神印王座为例，然后点进去进入详情页检察任意话；在浏览页中，网页源代码是没有我们需要的数据，所以需要打开开辟者工具举行抓包，最终成功找到图片的链接。
https://p6.toutiaoimg.com/large/pgc-image/f7fe2d7ad2984fd3bbfa47964e1249bd
https://p3.toutiaoimg.com/large/pgc-image/4aef163177b14319a2bdeb977f6d91a7
https://p5.toutiaoimg.com/large/pgc-image/a26e019f790e4f83b0cbe8e70aac6ecd
找到图片链接后，接着就要想办法从该数据包中获取，也就是访问该数据包的链接，从数据包中提取图片链接。通过多页的数据包，观察以下数据包链接，发现chapter_newid每次翻页会发生变革的，comic_id是一本漫画的唯一标识。
https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=1006&isWebp=1&quality=middle\https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=2003&isWebp=1&quality=middle\https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=3004&isWebp=1&quality=middle复制代码接着查找这两个参数是从那边来的。进入首页搜索神印王座，接着检察网页源代码，发现可以在网页源代码中找到进入漫画详情页的url；我试着用正则表达式和xpath语法举行提取时，发现困难重重，源代码中的HTML标签有很多的都相同的，且发现源代码中不止一本漫画。
https://p9.toutiaoimg.com/large/pgc-image/1dcddb96a1684350973cb5ae41f5531d
接着我试着搜索其它漫画，发现源代码中没有，我才发现我掉坑里，后来发现该源代码是网站首页的源代码，大意了，泪目！但没关系，源代码中没有，我们去抓包。
https://p26.toutiaoimg.com/large/pgc-image/43df812c8dbf4a4c9da3328d1986eed6
打开开辟者工具，进入Network中的XHR，搜索神印王座，第一次搜索的时候抓到一条数据包，不外他报红了：
https://p3.toutiaoimg.com/large/pgc-image/f452b28bff9b4e33a38b532bdee43b93
但里面是有我们需要的内容的。不外因为报红，我们在开辟者工具中是无法看到数据的，得点开数据包：
https://p9.toutiaoimg.com/large/pgc-image/25d053d31f824bed9f99c8e13e05831a
如果需要获取不报红的数据包，需要重新点击一下输入框，他就会加载出来了，如果只革新网页和重新点击搜索他都是无法获取到的。
https://p6.toutiaoimg.com/large/pgc-image/b6d0f19f4078446fb002ab1ed6e4b3c1
拿到数据包后，我们找到漫画的唯一标识comic_id，只需要该数据包中提取出来：
https://p5.toutiaoimg.com/large/pgc-image/ca38675e1f874ad481ccf23b4b90fcf9
找到comic_id后，接着找chapter_newid。chapter_newid变革规律每本漫画他都是不同的；但如果你第一次搜索的是斗罗大陆，你会发现，chapter_newid它是递增式变革的。
那chapter_newid怎么找呢，进入到漫画的详情页，前面我们知道神印王座的第一话的chapter_newid是1006，那我们直接在开辟者工具中搜索1006，最终在详情页源代码中找到：
https://p5.toutiaoimg.com/large/pgc-image/1505aeccd11b464497b71f19a987c841
那么我们知道，首个chapter_newid是详情页静态加载来的，可以在详情页的源代码中提取出来，而该网址是https://www.kanman.com/+comic_id构成的：
https://p26.toutiaoimg.com/large/pgc-image/d05fba656db24d0c91711f480048e9dc
这里只要第一话的chapter_newid，那其它的从那边得到呢？经过我的查找，发现后一页的chapter_newid是在前一页中获取到的：
https://p9.toutiaoimg.com/large/pgc-image/62c70343cccb415c884ba5e02074767a
代码实现

构建提取comic_id和chapter_id函数：
def get_comic(url):\ data = get_response(url).json()['data']\ for i in data:\    comic_id = i['comic_id']\    chapter_newid_url = f'https://www.kanman.com/{comic_id}/'\    chapter_newid_html = get_response(chapter_newid_url).text\    chapter_id = re.findall('{"chapter_id":"(.*?)"}', chapter_newid_html)\    data_html(comic_id, chapter_id)复制代码关键代码，如果以前爬取过微博评论数据的，就会发现，二者的套路差不多，翻页的数值都需要从前一页中获取：
def data_html(comic_id, chapter_id):\ try:\    a = 1\    while True: # 循环获取chapter_id\          if a == 1:\             comic_url = f'https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id={comic_id}&chapter_newid={chapter_id}&isWebp=1&quality=middle'\          else:\             comic_url = f'https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id={comic_id}&chapter_newid={chapter_newid}&isWebp=1&quality=middle'\          comic_htmls = get_response(comic_url).text\          comic_html_jsons = json.loads(comic_htmls)\          if a == 1:\             chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$..chapter_newid')\          else: # 自第二条url开始，提取规则+1\             chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$..chapter_newid')\          current_chapter = jsonpath.jsonpath(comic_html_jsons, '$..current_chapter')\          for img_and_name in current_chapter:\             image_url = jsonpath.jsonpath(img_and_name, '$..chapter_img_list') # 图片url\             # chapter_name 中存在空格，所以需要用strip去除\             chapter_name = jsonpath.jsonpath(img_and_name, '$..chapter_name').strip()\             save(image_url, chapter_name)\          a += 1\ except IndexError:\    pass复制代码保存数据：
def save(image_url, chapter_name):\ for link_url in image_url: # 图片名称\    image_name = ''.join(re.findall('/(\d+.jpg)-kmh', str(link_url)))\    image_path = data_path + chapter_name\    if not os.path.exists(image_path): # 创建章节标题文件夹\          os.mkdir(image_path)\    image_content = get_response(link_url).content\    filename = '{}/{}'.format(image_path, image_name)\    with open(filename, mode='wb') as f:\          f.write(image_content)\          print(image_name)\ get_img(chapter_name) # 拼接函数章节标题，非必需复制代码控制台：
if __name__ == '__main__':\ key = input('请输入你要下载的漫画：')\ data_path = r'D:/数据小刀/爬虫④/漫画/{}/'.format(key)\ if not os.path.exists(data_path): # 根据用户输入的漫画名称创建文件夹\    os.mkdir(data_path) \ url = f'https://www.kanman.com/api/getsortlist/?search_key={key}' # 该url由去除不必要的参数得到\ get_comic(url)复制代码保存的数据展示

喜欢记得点赞+关注哦~

页: [1]

创意电子's Archiver

Python爬虫实战，手把手讲授实现爬取网站漫画 附带源码

Python爬虫实战，手把手讲授实现爬取网站漫画附带源码