线程池爬取数据页数不对? | python | python 技术论坛-金年会app官方网

《提问的智慧》，高水平的讨论，需要大家一起维护。不遵循规范会。"> 问答 / 24 / 4 / /

from concurrent.futures import threadpoolexecutor
import csv
from bs4 import beautifulsoup
import requests
f = open('data.csv',mode='w')
csvwriter = csv.writer(f)
def download_one_page(url):
    resp = requests.get(url)
    res = beautifulsoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    ulss = div.find_all('ul')
    for ull in ulss:
        title = ull.find('p', class_="title").text
        img_url = ull.find('img').get('src')
        data = []
        data.append(title)
        data.append(img_url)
        csvwriter.writerow(data)
    print(url,'提取完成')
if __name__ == '__main__':
    with threadpoolexecutor(max_workers=10) as t:
        for i in range(1,16):
            t.submit(download_one_page,f"http://www.xinfadi.com.cn/newscenter.html?current={i}")
print('ok')

请教下，为什么我只能爬取到4个页的数据？

线程池爬取数据页数不对?

课程读者

暂无个人描述~

讨论数量: 4

deatil

见习助教 760 声望

看起来是其他页面爬取失败了，再爬取结果那可以判断下有没有成功

4个月前

（楼主）

谢谢 503

jason990420

1.9k 声望 / 個人 @ 個人

the submit() method does not block while the task is executing, it returns immediately with a future object that provides a handle on the task.

imo, the script end before all your tasks done !

not sure if it work well for following code

from concurrent.futures import threadpoolexecutor
import csv
from bs4 import beautifulsoup
import requests
def download_one_page(i):
    url = f"http://www.xinfadi.com.cn/newscenter.html?current={i}"
    resp = requests.get(url)
    res = beautifulsoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    if div:
        ulss = div.find_all('ul')
        for ull in ulss:
            title = ull.find('p', class_="title").text
            img_url = ull.find('img').get('src')
            data = []
            data.append(title)
            data.append(img_url)
            csvwriter.writerow(data)
        print(url, '提取完成')
    else:
        print(url, 'no conter_con class found !')
if __name__ == '__main__':
    f = open('data.csv',mode='w')
    csvwriter = csv.writer(f)
    with threadpoolexecutor(max_workers=5) as t:
        all_task = [t.submit(download_one_page, i) for i in range(1, 16)]
        results = [t.result() for t in all_task]    # wait for the result all done
    f.close()
print('ok')

http://www.xinfadi.com.cn/newscenter.html?current=3 no conter_con class found !
http://www.xinfadi.com.cn/newscenter.html?current=5 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=1 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=4 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=2 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=7 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=8 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=9 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=10 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=11 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=13 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=14 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=12 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=15 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=6 提取完成
ok

4个月前

（楼主）

no conter_con class found ! 是因为请求返回503

线程池爬取数据页数不对? | python | python 技术论坛-金年会app官方网

社区赞助商

关于 learnku

资源推荐

服务提供商

其他信息

线程池爬取数据页数不对? | python | python 技术论坛-金年会app官方网

社区赞助商

关于 learnku

资源推荐

服务提供商

其他信息

请登录