Python 代理服务器简介

本教程将介绍以下内容：

Python 代理服务器简介及其工作原理。
在 Python 中构建 HTTP 代理服务器所需的步骤。
这种方法的优缺点。

现在就来一探究竟吧！

Python 代理服务器是什么？

Python 代理服务器是一种 Python 应用程序，可充当客户端和互联网之间的中间服务器。该服务器可拦截来自客户端的请求，将这些请求转发到目标服务器，然后将目标服务器的响应发送回客户端，从而向目标服务器掩盖客户端的身份。

欢迎阅读本文，深入了解代理服务器的概念及其工作原理。

用户可以利用 Python 的套接字编程功能轻松实现基本的代理服务器，进而检查、修改或重定向网络流量。在网页抓取方面，代理服务器非常适合缓存、提高性能和增强安全性。

如何在 Python 中实现 HTTP 代理服务器

按照以下步骤操作，了解如何构建 Python 代理服务器脚本。

第 1 步：初始化 Python 项目

在开始之前，请确保计算机上已安装 Python 3 或以上版本。如果未安装，请下载安装程序并运行，按照安装向导进行操作即可。

接下来，使用以下命令创建名为“python-http-proxy-server”的文件夹，然后在其中初始化 Python 项目并创建虚拟环境：

mkdir python-http-proxy-server

cd python-http-proxy-server

python -m venv env

在 Python IDE 中打开名为“python-http-proxy-server”的文件夹，然后创建一个名为“proxy_server.py”的空文件。

太棒了！您已完成在 Python 中构建 HTTP 代理服务器的初始设置。

第 2 步：初始化传入套接字

首先，您需要创建一个 Web 套接字服务器来接受传入请求。如果不熟悉这一概念，可以这样理解：套接字是一种低级别的编程抽象，可在客户端和服务器之间实现双向数据流。在 Web 服务器上下文中，服务器套接字可用于监听来自客户端的传入连接。

您可以使用以下命令在 Python 中创建基于套接字的 Web 服务器：

port = 8888

# bind the proxy server to a specific address and port

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# accept up to 10 simultaneous connections

server.bind(('127.0.0.1', port))

server.listen(10)

这可以初始化传入的套接字服务器，并将其绑定到本地地址 http://127.0.0.1:8888，然后通过 listen() 方法确保服务器能够接受连接。

注意：您可以随意更改 Web 代理监听的端口号。您也可以修改脚本以从命令行读取该信息，从而最大限度地提高灵活性。

socket 是 Python 标准库中的一个模块，因此，您需要在脚本顶部添加以下导入语句：

import socket

要监控 Python 代理服务器是否已按要求启动，请记录以下消息：

 print(f"Proxy server listening on port {port}...")

第 3 步：接受客户端请求

客户端连接到代理服务器时，需要创建一个新的套接字来处理与该特定客户端的通信。以下是在 Python 中实现此操作的方法：

# listen for incoming requests

while True:

    client_socket, addr = server.accept()

    print(f"Accepted connection from {addr[0]}:{addr[1]}")

    # create a thread to handle the client request

    client_handler = threading.Thread(target=handle_client_request, args=(client_socket,))

    client_handler.start()

要同时处理多个客户端请求，应使用如上所示的多线程。千万别忘记从 Python 标准库中导入 threading 模块：

import threading

如您所见，代理服务器可通过自定义 handle_client_request() 函数处理传入请求。有关该函数的定义，详见接下来的步骤。

第 4 步：处理传入请求

创建客户端套接字后，可将其用于以下操作：

读取传入请求的数据。
从该数据中提取目标服务器的主机和端口。
将客户端请求转发到目标服务器。
获取响应并将其转发给原始客户端。

本节重点关注前两个步骤。定义 handle_client_request() 函数并用来读取传入请求的数据：

def handle_client_request(client_socket):

    print("Received request:\n")

    # read the data sent by the client in the request

    request = b''

    client_socket.setblocking(False)

    while True:

        try:

            # receive data from web server

            data = client_socket.recv(1024)

            request = request + data

            # Receive data from the original destination server

            print(f"{data.decode('utf-8')}")

        except:

            break

调用 setblocking(False)，将客户端套接字设置为非阻塞模式。然后，使用 recv() 读取传入的数据，并将其以字节格式附加到请求中。由于不知道传入请求数据的大小，因此一次只能读取一个区块。本例将一个区块指定为 1024 字节。在非阻塞模式下，如果 recv() 找不到任何数据，这将引发错误异常。因此，except 指令标志着操作的结束。

注意记录的消息，以跟踪 Python 代理服务器正在执行的操作。

检索传入请求后，您需要从中提取目标服务器的主机和端口：

host, port = extract_host_port_from_request(request)

In particular, this is what the extract_host_port_from_request() function looks like:

def extract_host_port_from_request(request):

    # get the value after the "Host:" string

    host_string_start = request.find(b'Host: ') + len(b'Host: ')

    host_string_end = request.find(b'\r\n', host_string_start)

    host_string = request[host_string_start:host_string_end].decode('utf-8')

    webserver_pos = host_string.find("/")

    if webserver_pos == -1:

        webserver_pos = len(host_string)

    # if there is a specific port

    port_pos = host_string.find(":")

    # no port specified

    if port_pos == -1 or webserver_pos < port_pos:

        # default port

        port = 80

        host = host_string[:webserver_pos]

    else:

        # extract the specific port from the host string

        port = int((host_string[(port_pos + 1):])[:webserver_pos - port_pos - 1])

        host = host_string[:port_pos]

    return host, port

To better understand what it does, consider the example below. This is what the encoded string of an incoming request usually contains:

GET http://example.com/your-page HTTP/1.1

Host: example.com

User-Agent: curl/8.4.0

Accept: */*

Proxy-Connection: Keep-Alive

extract_host_port_from_request() 可用于从“Host:”字段中提取 Web 服务器的主机和端口。本例中，主机是 example.com，端口是 80（因为尚未指定特定端口）。

第 5 步：转发客户端请求并处理响应

获得目标主机和端口后，现在您需要将客户端请求转发到目标服务器。在 handle_client_request() 中，创建新的 Web 套接字并用来将原始请求发送到目标服务器：

# create a socket to connect to the original destination server

destination_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect to the destination server

destination_socket.connect((host, port))

# send the original request

destination_socket.sendall(request)

Then, get ready to receive the server response and propagate it to the original client:

# read the data received from the server

# once chunk at a time and send it to the client

print("Received response:\n")

while True:

    # receive data from web server

    data = destination_socket.recv(1024)

    # Receive data from the original destination server

    print(f"{data.decode('utf-8')}")

    # no more data to send

    if len(data) > 0:

        # send back to the client

        client_socket.sendall(data)

    else:

        break

同样，由于不知道响应的大小，一次只能处理一个区块。数据为空时，没有更多数据可供接收，您可以终止操作。

千万别忘记关闭您在函数中定义的两个套接字：

# close the sockets

destination_socket.close()

client_socket.close()

太棒了！您已成功在 Python 中创建 HTTP 代理服务器。接下来该查看全部代码，启动并验证其是否按预期运行！

第 6 步：整合所有代码

Python 代理服务器脚本的最终代码如下所示：

import socket

import threading

def handle_client_request(client_socket):

    print("Received request:\n")

    # read the data sent by the client in the request

    request = b''

    client_socket.setblocking(False)

    while True:

        try:

            # receive data from web server

            data = client_socket.recv(1024)

            request = request + data

            # Receive data from the original destination server

            print(f"{data.decode('utf-8')}")

        except:

            break

    # extract the webserver's host and port from the request

    host, port = extract_host_port_from_request(request)

    # create a socket to connect to the original destination server

    destination_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # connect to the destination server

    destination_socket.connect((host, port))

    # send the original request

    destination_socket.sendall(request)

    # read the data received from the server

    # once chunk at a time and send it to the client

    print("Received response:\n")

    while True:

        # receive data from web server

        data = destination_socket.recv(1024)

        # Receive data from the original destination server

        print(f"{data.decode('utf-8')}")

        # no more data to send

        if len(data) > 0:

            # send back to the client

            client_socket.sendall(data)

        else:

            break

    # close the sockets

    destination_socket.close()

    client_socket.close()

def extract_host_port_from_request(request):

    # get the value after the "Host:" string

    host_string_start = request.find(b'Host: ') + len(b'Host: ')

    host_string_end = request.find(b'\r\n', host_string_start)

    host_string = request[host_string_start:host_string_end].decode('utf-8')

    webserver_pos = host_string.find("/")

    if webserver_pos == -1:

        webserver_pos = len(host_string)

    # if there is a specific port

    port_pos = host_string.find(":")

    # no port specified

    if port_pos == -1 or webserver_pos < port_pos:

        # default port

        port = 80

        host = host_string[:webserver_pos]

    else:

        # extract the specific port from the host string

        port = int((host_string[(port_pos + 1):])[:webserver_pos - port_pos - 1])

        host = host_string[:port_pos]

    return host, port

def start_proxy_server():

    port = 8888

    # bind the proxy server to a specific address and port

    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    server.bind(('127.0.0.1', port))

    # accept up to 10 simultaneous connections

    server.listen(10)

    print(f"Proxy server listening on port {port}...")

    # listen for incoming requests

    while True:

        client_socket, addr = server.accept()

        print(f"Accepted connection from {addr[0]}:{addr[1]}")

        # create a thread to handle the client request

        client_handler = threading.Thread(target=handle_client_request, args=(client_socket,))

        client_handler.start()

if __name__ == "__main__":

    start_proxy_server()

Launch it with this command:

python proxy_server.py

您应该会在终端看到以下消息：

Proxy server listening on port 8888...

要确保服务器正常运行，请使用 cURL 执行代理请求。欢迎阅读我们的指南，详细了解如何将 cURL 和代理一起使用。

打开新终端并运行以下命令：

curl --proxy "http://127.0.0.1:8888" "http://httpbin.org/ip"

这将通过代理服务器 http://127.0.0.1:8888 向目标服务器 http://httpbin.org/ip 发出 GET 请求。

响应结果如下：

{

  "origin": "45.12.80.183"

}

响应结果中显示的是代理服务器的 IP。为什么呢？这是因为在 HTTPBin 项目中，/ip 端点会返回发出请求的 IP。如果您在本地运行服务器，则“origin”字段将对应您的 IP。

注意：此处构建的 Python 代理服务器仅适用于 HTTP 目标。如需扩展以处理 HTTPS 连接，操作相当棘手。

接下来，浏览代理服务器 Python 应用程序编写的日志。日志应包含以下内容：

Received request:

GET http://httpbin.org/ip HTTP/1.1

Host: httpbin.org

User-Agent: curl/8.4.0

Accept: */*

Proxy-Connection: Keep-Alive

Received response:

HTTP/1.1 200 OK

Date: Thu, 14 Dec 2023 14:02:08 GMT

Content-Type: application/json

Content-Length: 31

Connection: keep-alive

Server: gunicorn/19.9.0

Access-Control-Allow-Origin: *

Access-Control-Allow-Credentials: true

{

  "origin": "45.12.80.183"

}

这表明代理服务器接收的请求采用 HTTP 协议指定的格式。代理服务器会将请求转发到目标服务器，记录响应数据，然后将响应发送回客户端。我们为什么确信这一点？这是因为响应结果中“origin”字段显示的 IP 地址与代理服务器的 IP 地址相同。

恭喜！您刚刚学会了如何在 Python 中构建 HTTP 代理服务器！

使用自定义 Python 代理服务器的优缺点

现在您已了解如何在 Python 中实现代理服务器，接下来就该了解这种方法的优点和局限性了。

优点：

完全控制：使用类似这样的自定义 Python 脚本有助于完全控制代理服务器的行为。不会有任何可疑活动，也不会泄露数据！
自定义：代理服务器可以扩展，以添加日志记录和缓存请求等实用功能，进而提高性能。

缺点：

基础设施成本高昂：设置代理服务器架构并不容易，在硬件或 VPS 服务方面需要花费大量资金。
难以维护：您需要负责维护代理架构，尤其是确保其可扩展性和可用性。只有经验丰富的系统管理员才能胜任这项任务。
不可靠：此解决方案的主要问题在于代理服务器的出口 IP 永远不会改变。因此，反机器人技术能够封锁该 IP 并阻止服务器访问所需的请求。换言之，代理最终将停止工作。

这些局限性和缺点使得在生产场景中使用自定义 Python 代理服务器变得不切实际。有何解决方案？像 Bright Data 这样可靠的代理提供商可助您解决上述问题！您只需创建账户，验证身份，即可获取免费代理，您可以在自己喜爱的编程语言中使用代理，例如在 Python 脚本中使用 requests 库集成代理。

我们庞大的代理网络涵盖全球数百万个快速、可靠、安全的代理服务器。敬请了解为什么我们是最出色的代理服务器提供商。

结语

本指南介绍了代理服务器的概念及其在 Python 中的工作原理。具体而言，您学会了如何使用 Web 套接字从头开始构建代理服务器。您现已成为 Python 代理专家。这种方法的主要问题在于代理服务器的静态出口 IP 最终会被封锁。Bright Data 的轮换代理有助于避免这种情况！

Bright Data 掌控着全球最出色的代理服务器，为财富 500 强企业和 20,000 多家客户提供服务。其代理网络涵盖不同类型的代理：

数据中心代理 — 超过 770,000 个数据中心 IP。
住宅代理 — 超过 7,200 万个住宅 IP，覆盖 195 个以上的国家/地区。
ISP 代理 — 超过 700,000 个 ISP IP。
移动代理 — 超过 700 万个移动 IP。

Bright Data 具有可靠、快速的全球代理网络，这也是许多网页抓取服务的基础，可帮助这些服务轻松检索各种网站的数据。

开启免费体验

Python 代理服务器简介

Python 代理服务器是什么？

如何在 Python 中实现 HTTP 代理服务器

第 1 步：初始化 Python 项目

第 2 步：初始化传入套接字

第 3 步：接受客户端请求

第 4 步：处理传入请求

第 5 步：转发客户端请求并处理响应

第 6 步：整合所有代码

使用自定义 Python 代理服务器的优缺点

结语

Ready to get started?

你也可能对此有兴趣

什么是替代数据及其使用方法

为什么代理网络在新款运动鞋上市时会被推到极限？

只有45%的美英企业拥有可靠的ESG数据访问权限

联系我们获取免费样本
亮数据洞察电商数据报告

Join our Partner Program

Python 代理服务器简介

Python 代理服务器是什么？

如何在 Python 中实现 HTTP 代理服务器

第 1 步：初始化 Python 项目

第 2 步：初始化传入套接字

第 3 步：接受客户端请求

第 4 步：处理传入请求

第 5 步：转发客户端请求并处理响应

第 6 步：整合所有代码

使用自定义 Python 代理服务器的优缺点

结语

Ready to get started?

你也可能对此有兴趣

什么是替代数据及其使用方法

为什么代理网络在新款运动鞋上市时会被推到极限？

只有45%的美英企业拥有可靠的ESG数据访问权限

联系我们 获取免费样本亮数据洞察电商数据报告

Join our Partner Program

Dataset Sample Request

联系我们获取免费样本
亮数据洞察电商数据报告