python - passing selenium response url to scrapy -
i learning python , trying scrape page specific value on dropdown menu. after need click each item on resulted table retrieve specific information. able select item , retrieve information on webdriver. not know how pass response url crawlspider.
driver = webdriver.firefox() driver.get('http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp') more_btn = webdriverwait(driver, 20).until( ec.visibility_of_element_located((by.id, '_button_select')) ) more_btn.click() ## select specific value dropdown driver.find_element_by_css_selector("select#tabjcwyxt_jiebie > option[value='teyaoxgrs']").click() driver.find_element_by_css_selector("select#tabjcwyxt_jieci > option[value='d11jie']").click() search2 = driver.find_element_by_class_name('input_a2') search2.click() time.sleep(5) ## convert html "nice format" text_html=driver.page_source.encode('utf-8') html_str=str(text_html) ## hack initiates "textresponse" object (taken scrapy module) resp_for_scrapy=textresponse('none',200,{},html_str,[],none) ## convert html "nice format" text_html=driver.page_source.encode('utf-8') html_str=str(text_html) resp_for_scrapy=textresponse('none',200,{},html_str,[],none)
so stuck. able query using above code. how can pass resp_for_scrapy crawlspider? put resp_for_scrapy in place of item didn't work.
## spider class profilespider(crawlspider): name = 'pccprofile2' allowed_domains = ['cppcc.gov.cn'] start_urls = ['http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp'] def parse(self, resp_for_scrapy): hxs = htmlxpathselector(resp_for_scrapy) post in resp_for_scrapy.xpath('//div[@class="table"]//ul//li'): items = [] item = ppcprofile2item() item ["name"] = hxs.select("//h1/text()").extract() item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract() items.append(item) ##click next page while true: next = self.driver.findelement(by.linktext("下一页")) try: next.click() except: break return(items)
any suggestions appreciated!!!!
edits included middleware class select dropdown before spider class. there no error , no result.
class jsmiddleware(object): def process_request(self, request, spider): driver = webdriver.phantomjs() driver.get('http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp') # select dropdown more_btn = webdriverwait(driver, 20).until( ec.visibility_of_element_located((by.id, '_button_select')) ) more_btn.click() driver.find_element_by_css_selector("select#tabjcwyxt_jiebie > option[value='teyaoxgrs']").click() driver.find_element_by_css_selector("select#tabjcwyxt_jieci > option[value='d11jie']").click() search2 = driver.find_element_by_class_name('input_a2') search2.click() time.sleep(5) #get response body = driver.page_source return htmlresponse(driver.current_url, body=body, encoding='utf-8', request=request) class profilespider(crawlspider): name = 'pccprofile2' rules = [rule(sgmllinkextractor(allow=(),restrict_xpaths=("//div[@class='table']")), callback='parse_item')] def parse_item(self, response): hxs = htmlxpathselector(response) items = [] item = ppcprofile2item() item ["name"] = hxs.select("//h1/text()").extract() item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract() items.append(item) #click next page while true: next = response.findelement(by.linktext("下一页")) try: next.click() except: break return(items)
use downloader middleware catch selenium-required pages before process them regularly scrapy:
the downloader middleware framework of hooks scrapy’s request/response processing. it’s light, low-level system globally altering scrapy’s requests , responses.
here's basic example using phantomjs:
from scrapy.http import htmlresponse selenium import webdriver class jsmiddleware(object): def process_request(self, request, spider): driver = webdriver.phantomjs() driver.get(request.url) body = driver.page_source return htmlresponse(driver.current_url, body=body, encoding='utf-8', request=request)
once return htmlresponse
(or textresponse
if that's want), scrapy cease processing downloaders , drop spider's parse
method:
if returns response object, scrapy won’t bother calling other process_request() or process_exception() methods, or appropriate download function; it’ll return response. process_response() methods of installed middleware called on every response.
in case, can continue use spider's parse
method html, except js on page has been executed.
tip: since downloader middleware's process_request
method accepts spider argument, can add conditional in spider check whether need process js @ all, , let handle both js , non-js pages exact same spider class.
Comments
Post a Comment