arXiv Paper Downloader

This Python program can automatically download papers from arXiv based on specific keywords (multiple keywords are supported), the range of publication year-month, and the desired number of papers to download. Please note that this script currently supports only the Chrome browser.

[Download here (GitHub)]


Feature Highlights

  • Keywords Search: Conducts paper searches based on keywords provided by the user, supporting multiple keyword entries.
  • Paper Upload Date Range Setting: Users can specify the year-month range for the paper uploaded date they wish to search.
  • Automatic Download: The program automatically downloads the search results to a specific local folder.
  • Avoid Duplicate Downloads: Records information about downloaded papers in an Excel file for future reference and to prevent duplicate downloads.

Installation Requirements

Please ensure your system has the following Python packages installed:

  • pandas
  • requests
  • beautifulsoup4
  • selenium

You can install these packages using the following command:

pip install pandas requests beautifulsoup4 selenium

User Instructions

The following instructions are intended for Windows users:

  1. Clone or download the program to your computer

  2. Open Command Prompt (cmd) or PowerShell:

    • Press the Win key, then type “cmd” or “PowerShell” in the search box and select the appropriate program to open.
  3. Navigate to the directory containing your Python program:

    • Use the cd command to change to the directory where your main.py file is located. For example, if your main.py file is in C:\Users\Username\Documents\Project, you can enter:
      cd C:\Users\Username\Documents\Project
      
  4. Execute the program:

    • Enter the following command in the command line, ensuring to replace the parameters in brackets with your specified settings:
      python main.py --queries [keywords] --start_year_month [start year-month] --end_year_month [end year-month] --num_papers [number of papers]
      
    • For example, to download papers on “fairness”, “machine learning”, and “synthetic data generation”, and to set the search time period between January 2023 and December 2023, you can use the following command:
      python main.py --queries "fairness" "machine learning" "synthetic data generation" --start_year_month 202301 --end_year_month 202312 --num_papers 5
      

Please ensure your system has Python and the necessary packages installed, as outlined in the Installation Requirements section.

Setting Default Parameters

This program supports setting default parameter values, allowing users to automatically use predefined search settings when no command line parameters are specified. This feature is convenient for users who frequently use the same parameters.

Default Value Settings

Below is the implementation of the Config class, which is used in the program to store all default parameters:

class Config:
    DEFAULT_QUERIES = ["fairness", "machine learning", "synthetic data generation"]
    DEFAULT_START_YEAR_MONTH = "202301"
    DEFAULT_END_YEAR_MONTH = "202312"
    DEFAULT_NUM_PAPERS = 5

Using Default Values

When you do not specify corresponding parameters in the command line, the program will automatically use the default values defined in the Config class. Additionally, if you only specify some parameters, the other unspecified parameters will also default to these preset values.

Modifying Default Values

To modify these default values, you can directly edit them in the Config class. This flexibility allows the program to better adapt to your specific needs without the need to manually enter all parameters each time you run it, thus simplifying routine operations.