python - passing selenium response url to scrapy -


i learning python , trying scrape page specific value on dropdown menu. after need click each item on resulted table retrieve specific information. able select item , retrieve information on webdriver. not know how pass response url crawlspider.

driver = webdriver.firefox() driver.get('http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp') more_btn = webdriverwait(driver, 20).until(      ec.visibility_of_element_located((by.id, '_button_select'))             )   more_btn.click()  ## select specific value dropdown driver.find_element_by_css_selector("select#tabjcwyxt_jiebie >     option[value='teyaoxgrs']").click() driver.find_element_by_css_selector("select#tabjcwyxt_jieci > option[value='d11jie']").click() search2 = driver.find_element_by_class_name('input_a2') search2.click() time.sleep(5)  ## convert html "nice format" text_html=driver.page_source.encode('utf-8') html_str=str(text_html)  ## hack initiates "textresponse" object (taken scrapy module) resp_for_scrapy=textresponse('none',200,{},html_str,[],none)  ## convert html "nice format" text_html=driver.page_source.encode('utf-8') html_str=str(text_html)  resp_for_scrapy=textresponse('none',200,{},html_str,[],none) 

so stuck. able query using above code. how can pass resp_for_scrapy crawlspider? put resp_for_scrapy in place of item didn't work.

## spider  class profilespider(crawlspider): name = 'pccprofile2' allowed_domains = ['cppcc.gov.cn'] start_urls = ['http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp']      def parse(self, resp_for_scrapy):      hxs = htmlxpathselector(resp_for_scrapy)     post in resp_for_scrapy.xpath('//div[@class="table"]//ul//li'):         items = []         item = ppcprofile2item()         item ["name"] = hxs.select("//h1/text()").extract()         item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract()         items.append(item)      ##click next page           while true:         next = self.driver.findelement(by.linktext("下一页"))         try:             next.click()         except:             break      return(items) 

any suggestions appreciated!!!!

edits included middleware class select dropdown before spider class. there no error , no result.

class jsmiddleware(object):     def process_request(self, request, spider):         driver = webdriver.phantomjs()          driver.get('http://www.cppcc.gov.cn/cms/icms/project1/cppcc/wylibary/wjweiyuanlist.jsp')       # select dropdown         more_btn = webdriverwait(driver, 20).until(         ec.visibility_of_element_located((by.id, '_button_select'))                 )         more_btn.click()           driver.find_element_by_css_selector("select#tabjcwyxt_jiebie > option[value='teyaoxgrs']").click()         driver.find_element_by_css_selector("select#tabjcwyxt_jieci > option[value='d11jie']").click()         search2 = driver.find_element_by_class_name('input_a2')         search2.click()         time.sleep(5)          #get response          body = driver.page_source         return htmlresponse(driver.current_url, body=body, encoding='utf-8', request=request)    class profilespider(crawlspider):     name = 'pccprofile2'     rules = [rule(sgmllinkextractor(allow=(),restrict_xpaths=("//div[@class='table']")), callback='parse_item')]        def parse_item(self, response):     hxs = htmlxpathselector(response)     items = []     item = ppcprofile2item()     item ["name"] = hxs.select("//h1/text()").extract()     item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract()     items.append(item)      #click next page           while true:         next = response.findelement(by.linktext("下一页"))         try:             next.click()         except:             break      return(items) 

use downloader middleware catch selenium-required pages before process them regularly scrapy:

the downloader middleware framework of hooks scrapy’s request/response processing. it’s light, low-level system globally altering scrapy’s requests , responses.

here's basic example using phantomjs:

from scrapy.http import htmlresponse selenium import webdriver  class jsmiddleware(object):     def process_request(self, request, spider):         driver = webdriver.phantomjs()         driver.get(request.url)          body = driver.page_source         return htmlresponse(driver.current_url, body=body, encoding='utf-8', request=request) 

once return htmlresponse (or textresponse if that's want), scrapy cease processing downloaders , drop spider's parse method:

if returns response object, scrapy won’t bother calling other process_request() or process_exception() methods, or appropriate download function; it’ll return response. process_response() methods of installed middleware called on every response.

in case, can continue use spider's parse method html, except js on page has been executed.

tip: since downloader middleware's process_request method accepts spider argument, can add conditional in spider check whether need process js @ all, , let handle both js , non-js pages exact same spider class.


Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -