arXiv Paper Downloader
This Python program can automatically download papers from arXiv based on specific keywords (multiple keywords are supported), the range of publication year-month, and the desired number of papers to download. Please note that this script currently supports only the Chrome browser.
Feature Highlights
- Keywords Search: Conducts paper searches based on keywords provided by the user, supporting multiple keyword entries.
- Paper Upload Date Range Setting: Users can specify the year-month range for the paper uploaded date they wish to search.
- Automatic Download: The program automatically downloads the search results to a specific local folder.
- Avoid Duplicate Downloads: Records information about downloaded papers in an Excel file for future reference and to prevent duplicate downloads.
Installation Requirements
Please ensure your system has the following Python packages installed:
- pandas
- requests
- beautifulsoup4
- selenium
You can install these packages using the following command:
pip install pandas requests beautifulsoup4 selenium
User Instructions
The following instructions are intended for Windows users:
-
Clone or download the program to your computer
-
Open Command Prompt (cmd) or PowerShell:
- Press the
Win
key, then type “cmd” or “PowerShell” in the search box and select the appropriate program to open.
- Press the
-
Navigate to the directory containing your Python program:
- Use the
cd
command to change to the directory where yourmain.py
file is located. For example, if yourmain.py
file is inC:\Users\Username\Documents\Project
, you can enter:cd C:\Users\Username\Documents\Project
- Use the
-
Execute the program:
- Enter the following command in the command line, ensuring to replace the parameters in brackets with your specified settings:
python main.py --queries [keywords] --start_year_month [start year-month] --end_year_month [end year-month] --num_papers [number of papers]
- For example, to download papers on “fairness”, “machine learning”, and “synthetic data generation”, and to set the search time period between January 2023 and December 2023, you can use the following command:
python main.py --queries "fairness" "machine learning" "synthetic data generation" --start_year_month 202301 --end_year_month 202312 --num_papers 5
- Enter the following command in the command line, ensuring to replace the parameters in brackets with your specified settings:
Please ensure your system has Python and the necessary packages installed, as outlined in the Installation Requirements section.
Setting Default Parameters
This program supports setting default parameter values, allowing users to automatically use predefined search settings when no command line parameters are specified. This feature is convenient for users who frequently use the same parameters.
Default Value Settings
Below is the implementation of the Config
class, which is used in the program to store all default parameters:
class Config:
DEFAULT_QUERIES = ["fairness", "machine learning", "synthetic data generation"]
DEFAULT_START_YEAR_MONTH = "202301"
DEFAULT_END_YEAR_MONTH = "202312"
DEFAULT_NUM_PAPERS = 5
Using Default Values
When you do not specify corresponding parameters in the command line, the program will automatically use the default values defined in the Config
class. Additionally, if you only specify some parameters, the other unspecified parameters will also default to these preset values.
Modifying Default Values
To modify these default values, you can directly edit them in the Config
class. This flexibility allows the program to better adapt to your specific needs without the need to manually enter all parameters each time you run it, thus simplifying routine operations.