Active5 months ago
I got the proxy list with proxybroker.
To change from the format
<Proxy US 0.00s [] 104.131.6.78:80>
into 104.131.6.78:80
with grep.How to Read and Respect Robots.txt. Published by Jacob Koshy on March 3, 2017. The web is known as an open place – but that would be just an exaggeration if you take a closer look. The web that we know is actually just the tip of a huge iceberg. Search engine crawlers have access only to the ‘Surface web’ which is a name for the smaller.
All the proxy in proxy.csv in the following format.
I wrote my scrawler according to the webpage.
Multiple Proxies
Multiple Proxies
Here is my frame structure--test.py.
The error info occurs when to run the spider with
scrapy runspider test.py
Connection was refused by other side: 111: Connection refused.
With the same proxy got from
To make it simple,all broken proxy ip remain instead of being removed.
The codes snippet following is to test whether proxy ip can be used instead of downloading url set perfectly.
The program structure are as following.
proxybroker
,i use my own way to download the url set instead of scrapy.To make it simple,all broken proxy ip remain instead of being removed.
The codes snippet following is to test whether proxy ip can be used instead of downloading url set perfectly.
The program structure are as following.
Many urls can be downloaded with proxy grabbed by
It is clear that :
proxybroker
.It is clear that :
- many proxy ip grabbed by
proxybroker
can be used,many of them are free and stable. - some bug in my scrapy codes.
How to fix bugs in my scrapy?
vezunchik3,28133 gold badges1212 silver badges2525 bronze badges
it_is_a_literatureit_is_a_literature2792323 gold badges7070 silver badges167167 bronze badges
1 Answer
try using the scrapy-proxies
In your
Settings.py
you can make changes something like this: Hopefully this will help you, as this solved my problem too.
Jaffer WilsonJaffer Wilson3,28533 gold badges2929 silver badges6868 bronze badges