Web Scraping Articles Through HTTP Proxies with Python
Here is an example Python script that uses the requests library to scrape the content of a web page through an HTTP proxy that requires username and password authentication. First, ensure that the requests library is installed; if not, you can install it by running pip install requests
.
Python Script Example
import requests
from bs4 import BeautifulSoup
# Proxy server address and port
proxy_host = 'proxy_ip'
proxy_port = 'proxy_port'
# Proxy username and password
proxy_user = 'username'
proxy_pass = 'password'
proxies = {
'http': f'http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}',
'https': f'https://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}',
}
# URL to be scraped
url = 'http://example.com'
try:
# Make the request using the proxy
response = requests.get(url, proxies=proxies)
response.raise_for_status() # Raises HTTPError for bad responses
# Parse the webpage content
soup = BeautifulSoup(response.text, 'html.parser')
# Assuming the content is within an article tag
article = soup.find('article')
# Print the content
if article:
print(article.text)
else:
print("No article content found.")
except requests.exceptions.HTTPError as err:
print(f"HTTP Error: {err}")
except Exception as err:
print(f"Error occurred: {err}")
This script connects to an HTTP proxy authenticated by username and password, then attempts to fetch content from a specified URL. It uses the BeautifulSoup library to parse HTML and tries to print the content within the <article> tag. You'll need to replace proxy_ip, proxy_port, username, password, and http://example.com with your actual proxy server details and target URL. If the web page's article content is structured differently, you may need to adjust the BeautifulSoup selector accordingly.
To set up and run the above Python scraping script on a CentOS server, steps include installing necessary software and libraries, configuring the script to use the proxy, and executing the script. Here are the detailed steps:
Step 1: Install Python
First, ensure that Python is installed on your CentOS server. Most modern CentOS systems come with Python pre-installed, but you can confirm by running:
python --version
or (for Python 3):
python3 --version
If Python is not installed, you can install it using:
sudo yum install python3
Step 2: Install pip
pip is Python's package manager, used to install and manage Python packages. Install pip on CentOS using:
sudo yum install python3-pip
Step 3: Install Necessary Python Libraries
You need to install the requests and beautifulsoup4 libraries. Use pip to install these:
pip3 install requests beautifulsoup4
Step 4: Create the Script
Use your preferred text editor (such as nano or vim) to create a new Python script file:
nano my_scraper.py
Then, copy and paste the provided Python script code into this file. Don't forget to modify the proxy settings and target URL to your own data.
Step 5: Run the Script
Save the file and exit the editor, then run the script from the command line:
python3 my_scraper.py
This will execute the script, making a web page request through the specified HTTP proxy and printing the article content.
Considerations
Ensure that your firewall and proxy settings allow your server to access the external network through the specified ports.
Adjust the proxy authentication and web content extraction parts of the Python script as needed for your requirements.
If you are working within a virtual environment, ensure that the necessary libraries are installed in that environment.
By following these steps, you should be able to set up and run a Python scraping script on a CentOS server.
Web Scraping Articles Through HTTP Proxies with Python Review FAQ
Proxy connection tools are pow...
Setting up proxies on Windows ...
Choosing between SOCKS and HTT...
Encountering "Proxy Error Code...