线程池爬取数据页数不对? | python | python 技术论坛-金年会app官方网

from concurrent.futures import threadpoolexecutor
import csv
from bs4 import beautifulsoup
import requests
f = open('data.csv',mode='w')
csvwriter = csv.writer(f)
def download_one_page(url):
    resp = requests.get(url)
    res = beautifulsoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    ulss = div.find_all('ul')
    for ull in ulss:
        title = ull.find('p', class_="title").text
        img_url = ull.find('img').get('src')
        data = []
        data.append(title)
        data.append(img_url)
        csvwriter.writerow(data)
    print(url,'提取完成')
if __name__ == '__main__':
    with threadpoolexecutor(max_workers=10) as t:
        for i in range(1,16):
            t.submit(download_one_page,f"http://www.xinfadi.com.cn/newscenter.html?current={i}")
print('ok')

请教下,为什么我只能爬取到4个页的数据?

线程池爬取数据页数不对?

讨论数量: 4

看起来是其他页面爬取失败了,再爬取结果那可以判断下有没有成功

4个月前
(楼主) 4个月前
jason990420

the submit() method does not block while the task is executing, it returns immediately with a future object that provides a handle on the task.

imo, the script end before all your tasks done !

not sure if it work well for following code

from concurrent.futures import threadpoolexecutor
import csv
from bs4 import beautifulsoup
import requests
def download_one_page(i):
    url = f"http://www.xinfadi.com.cn/newscenter.html?current={i}"
    resp = requests.get(url)
    res = beautifulsoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    if div:
        ulss = div.find_all('ul')
        for ull in ulss:
            title = ull.find('p', class_="title").text
            img_url = ull.find('img').get('src')
            data = []
            data.append(title)
            data.append(img_url)
            csvwriter.writerow(data)
        print(url, '提取完成')
    else:
        print(url, 'no conter_con class found !')
if __name__ == '__main__':
    f = open('data.csv',mode='w')
    csvwriter = csv.writer(f)
    with threadpoolexecutor(max_workers=5) as t:
        all_task = [t.submit(download_one_page, i) for i in range(1, 16)]
        results = [t.result() for t in all_task]    # wait for the result all done
    f.close()
print('ok')
http://www.xinfadi.com.cn/newscenter.html?current=3 no conter_con class found !
http://www.xinfadi.com.cn/newscenter.html?current=5 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=1 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=4 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=2 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=7 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=8 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=9 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=10 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=11 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=13 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=14 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=12 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=15 提取完成
http://www.xinfadi.com.cn/newscenter.html?current=6 提取完成
ok
4个月前
(楼主) 4个月前

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!
网站地图