Home What Is a Proxy Server? Web Scraping Articles Through HTTP Proxies with Python

Web Scraping Articles Through HTTP Proxies with Python

Pandada Article published on 4 month ago

4.70

Here is an example Python script that uses the requests library to scrape the content of a web page through an HTTP proxy that requires username and password authentication. First, ensure that the requests library is installed; if not, you can install it by running pip install requests.

Python Script Example


    import requests
    from bs4 import BeautifulSoup
    
    # Proxy server address and port
    proxy_host = 'proxy_ip'
    proxy_port = 'proxy_port'
    
    # Proxy username and password
    proxy_user = 'username'
    proxy_pass = 'password'
    
    proxies = {
        'http': f'http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}',
        'https': f'https://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}',
    }
    
    # URL to be scraped
    url = 'http://example.com'
    
    try:
        # Make the request using the proxy
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()  # Raises HTTPError for bad responses
    
        # Parse the webpage content
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # Assuming the content is within an article tag
        article = soup.find('article')
    
        # Print the content
        if article:
            print(article.text)
        else:
            print("No article content found.")
    
    except requests.exceptions.HTTPError as err:
        print(f"HTTP Error: {err}")
    except Exception as err:
        print(f"Error occurred: {err}")

This script connects to an HTTP proxy authenticated by username and password, then attempts to fetch content from a specified URL. It uses the BeautifulSoup library to parse HTML and tries to print the content within the <article> tag. You'll need to replace proxy_ip, proxy_port, username, password, and http://example.com with your actual proxy server details and target URL. If the web page's article content is structured differently, you may need to adjust the BeautifulSoup selector accordingly.

To set up and run the above Python scraping script on a CentOS server, steps include installing necessary software and libraries, configuring the script to use the proxy, and executing the script. Here are the detailed steps:

Step 1: Install Python

First, ensure that Python is installed on your CentOS server. Most modern CentOS systems come with Python pre-installed, but you can confirm by running:

python --version

or (for Python 3):

python3 --version

If Python is not installed, you can install it using:

sudo yum install python3

Step 2: Install pip

pip is Python's package manager, used to install and manage Python packages. Install pip on CentOS using:

sudo yum install python3-pip

Step 3: Install Necessary Python Libraries

You need to install the requests and beautifulsoup4 libraries. Use pip to install these:

pip3 install requests beautifulsoup4

Step 4: Create the Script

Use your preferred text editor (such as nano or vim) to create a new Python script file:

nano my_scraper.py

Then, copy and paste the provided Python script code into this file. Don't forget to modify the proxy settings and target URL to your own data.

Step 5: Run the Script

Save the file and exit the editor, then run the script from the command line:

python3 my_scraper.py

This will execute the script, making a web page request through the specified HTTP proxy and printing the article content.

Considerations

Ensure that your firewall and proxy settings allow your server to access the external network through the specified ports.

Adjust the proxy authentication and web content extraction parts of the Python script as needed for your requirements.

If you are working within a virtual environment, ensure that the necessary libraries are installed in that environment.

By following these steps, you should be able to set up and run a Python scraping script on a CentOS server.

Web Scraping Articles Through HTTP Proxies with Python Review FAQ

Installing Python 3 on a CentOS server is straightforward. You can use the YUM package manager to install it. Open the terminal and enter the following command: sudo yum install python3. This command will automatically install Python 3 and its dependencies.

Yes, you can use YUM to install pip. Simply open the terminal and run the following command: sudo yum install python3-pip. This will install pip, which is Python's package management tool used for installing and managing other Python libraries and packages.

Previous 15+ Cheap and Reliable Socks5 Proxies Providers Discover the best Socks5 Proxi...

Next How to Install the Curl Command on Windows? When testing Proxy Servers and...

20 Popular SOCKS5 Pr...

Proxy connection tools are pow...

Windows 10 Proxy Ser...

Setting up proxies on Windows ...

What Are the Differe...

Choosing between SOCKS and HTT...

Common HTTP Proxy Er...

Encountering "Proxy Error Code...