Scrapy Spiders
- class stepstonesearch.spiders.Links.LinksSpider(*args: Any, **kwargs: Any)
A Scrapy spider to scrape job listing links from Stepstone.
This spider starts from the search results page for a given job title, extracts job listing data (e.g., title, company, location, link), and follows pagination up to a specified maximum number of pages or jobs. The results are saved to a JSON file.
- Variables:
name – The name of the spider.
allowed_domains – Domains allowed for the spider to crawl.
custom_settings – Custom settings for the spider, including the output feed configuration.
base_url – The base URL template for Stepstone job search pages.
start_urls – The initial URL(s) to start scraping from.
jobs_collected – A counter for the number of jobs collected so far.
- extract_items(data)
Extract the ‘items’ array from the JSON-like data in the page source.
This method uses a stack-based approach to find the balanced brackets of the ‘items’ array in the page’s script data.
- Parameters:
data – The raw HTML content of the page.
- Returns:
The extracted ‘items’ array as a string, or None if not found.
- parse(response)
Parse the search results page and extract job listing data.
This method extracts job items from the page, yields them as dictionaries, and follows pagination links until the maximum number of pages or jobs is reached.
- Parameters:
response – The Scrapy response object containing the search results page HTML.
- class stepstonesearch.spiders.sitespider.sitespiderSpider(*args: Any, **kwargs: Any)
A Scrapy spider to scrape detailed job information from Stepstone.
This spider reads job links from a provided JSON file, visits each job page, extracts job details such as job title, company name, location, salary, job description paragraphs, and lists (e.g., benefits), and saves the data in the variable job_data.
- Variables:
name – The name of the spider.
allowed_domains – Domains allowed for the spider to crawl.
input_file – Path to the JSON file containing job links.
output_file – Path to the output JSON file where scraped data will be saved.
items – List of job items loaded from the input JSON file.
job_details – List to store scraped job details.
job_title – The job title used for naming the output file.
- closed(reason)
Save the collected job details to a JSON file when the spider is closed.
- Parameters:
reason – The reason for the spider being closed.
- extract_job_id(url)
Extract the job ID from the job page URL.
The job ID is extracted using a regular expression that looks for a sequence of digits before “-inline.html”.
- Parameters:
url – The URL of the job page.
- Returns:
The extracted job ID or None if not found.
- load_items()
Load job items from the input JSON file.
- Returns:
List of job items or an empty list if loading fails.
- parse(response)
Parse the job page and extract relevant job details.
This method extracts paragraphs, lists (e.g., benefits), and other job details from the page. It cleans the text and organizes the data into a dictionary, which is then appended to the job_details list.
- Parameters:
response – The Scrapy response object containing the job page HTML.
- start_requests()
Generate Scrapy requests for each job link in the input items.
Each request includes the job item metadata and is sent to the parse method for processing.