Bypass reCAPTCHA And Prevent IP Blocking Using Tor Proxy
When we run web crawlers, sometimes we get blocked by the target site. Sometimes we get reCAPTCHA to solve, and crawling gets interrupted. We can rotate the IP address with each request to avoid these issues, which solves the IP blocking and reCAPTCHA issues.
This blog is the written version of the video content I published on YouTube. If you prefer watching videos than reading blogs, then you can watch the video.
We will use a Tor proxy to rotate the IP address with each HTTP request. First, let's install the Tor browser. Open the terminal and use the following commands to install the Tor browser on your machine:
1sudo add-apt-repository ppa:micahflee/ppa
2sudo apt install torbrowser-launcher
Now, you should have a torrc
file in your /etc/tor/
directory.
Edit torrc
file in your /etc/tor/
directory
open the torrc file using $ sudo nano torrc
and uncomment the following lines (usually these are commented out):
1ControlPort 9051
2HashedControlPassword 16:2D99FRCE35858C6F608DB3122A6C8DA4C35BE5E105B9B54A7E438B122F
3CookieAuthentication 1
There is a HashedControlPassword
in your torrc file, we will replace this password with a new password created by you.
Use the following command to create a new password. Open up your terminal and create a new password using the following command.
1tor --hash-password <password key>
For example,
1tor --hash-password mypass
This will create a password for the key mypass
and display the password on your terminal. Note the key and password both. We will use both later.
Now, replace the HashedControlPassword
in your torrc file, which is located in /etc/tor/
directory. You can use nano or any other editor. Save the torrc
file.
Now, we will use the mypass
keyword to renew connections with each request. First, you have to install the stem and request library.
1pip install stem
2pip install requests
Now, create a new python file and use the following code to change your IP address:
1from stem import Signal
2from stem.control import Controller
3import requests
4
5
6def get_tor_session():
7 # initialize a requests Session
8 session = requests.Session()
9 # this requires a running Tor service in your machine and listening on port 9050 (by default)
10 session.proxies = {
11 "http": "socks5://127.0.0.1:9050",
12 "https": "socks5://127.0.0.1:9050",
13 }
14 return session
15
16
17def renew_connection():
18 with Controller.from_port(port=9051) as controller:
19 controller.authenticate(password="mypass")
20 controller.signal(Signal.NEWNYM)
See, how we have used mypass
keyword in the renew_connection
method.
Now, let's use the tor session to send http request to some URLs.
1headers = {
2 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11"
3}
4
5
6def send_request(url_list):
7 for url in url_list:
8 try:
9 # renew the connection
10 renew_connection()
11 # create a new tor session
12 session = get_tor_session()
13 html_content = session.get(url, headers=headers).text
14 print( "IP rotated to:",
15 session.get("https://ident.me", headers=headers).text)
16
17 except Exception as e:
18 print(e)
19 pass
20
21if __name__ == "__main__":
22 # IP address before IP rotation
23 print("Your Public IP:", requests.get("https://ident.me").text)
24 urls = [
25 "https://www.google.com",
26 "https://www.facebook.com",
27 "https://www.youtube.com",
28 "https://www.amazon.com",
29 ] * 10
30
31 send_request(urls)
We are using the https://ident.me
site to print the IP address with each request. You will see different IP address with each print statement execution.
Following this procedure, the program might become slow. So it's better to use multiprocessing or multithreading to make the process faster. You can do the following to do multiprocessing.
1from multiprocessing import Pool
2
3if __name__ == "__main__":
4 # IP address before IP rotation
5 print("Your Public IP:", requests.get("https://ident.me").text)
6
7 urls = [
8 "https://www.google.com",
9 "https://www.facebook.com",
10 "https://www.youtube.com",
11 "https://www.amazon.com",
12 ] * 10
13
14 # send requests in parallel using multiprocessing
15 with Pool(processes=20) as pool:
16 pool.map(send_request, [urls[i : i + 10] for i in range(0, len(urls), 10)])
17 pool.close()
18 pool.join()
Code can be found in this GitHub repository : https://github.com/sksoumik/rotate_IP
Thanks for the read.
Posts in this Series
- How to Achieve Perfect Selfie Segmentation and Background Removal
- Building an Instagram Auto-Liker Bot - A Step-by-Step Guide
- Working with Elasticsearch on Linux Using Python Client
- Multi-class Text Classification Using Apache Spark MLlib
- Keyphrase Extraction with BERT Embeddings and Part-Of-Speech Patterns
- Bypass reCAPTCHA And Prevent IP Blocking Using Tor Proxy