Stepstone Scraper

stepstone_scraper.get_latest_output_file(directory, job_title)

Retrieve the most recent JSON output file for a given job title in the specified directory.

This function lists all JSON files in the directory that start with the job title and selects the most recently created one.

Parameters:
  • directory – The directory to search for JSON files.

  • job_title – The job title used to filter the files.

Returns:

The path to the most recent JSON file, or None if no files are found.

stepstone_scraper.run_spiders(job_title, db)

Run Scrapy spiders to scrape job listings from Stepstone and save the data to MongoDB.

This function sets up and runs two Scrapy spiders: one to collect job links and another to scrape detailed job information. It manages the output files and ensures the scraped data is saved to the appropriate MongoDB collection.

Parameters:
  • job_title – The job title to search for on Stepstone.

  • db – A MongoDB database instance to store the scraped data.

the project_path is defined as follows for the Docker configuration: ‘/app/stepstonesearch’, in the case of local execution this must be adapted accordingly (localpath/stepstonesearch)

stepstone_scraper.save_to_mongo(json_file, job_title, db)

Save the scraped job data from a JSON file to a MongoDB collection.

This function reads the JSON file produced by the Scrapy spider, processes the data, and inserts or updates the records in the specified MongoDB collection. It handles both single job entries and lists of jobs.

Parameters:
  • json_file – The path to the JSON file containing the scraped job data.

  • job_title – The job title used to determine the MongoDB collection name.

  • db – A MongoDB database instance to store the data.

stepstone_scraper.wait_for_file(json_file, timeout=30)

Wait for a specified file to be created within a given timeout period.

This function is used to wait for the output file generated by a Scrapy spider. It checks periodically if the file exists and returns once the file is found or the timeout is reached.

Parameters:
  • json_file – The path to the file to wait for.

  • timeout – The maximum time to wait in seconds (default is 30 seconds).

Returns:

True if the file is found within the timeout, False otherwise.