When optimizing your web scraping projects, conserving bandwidth is key. Explore our tips and guidelines below to utilize bandwidth-saving techniques within your script and ensure efficient, resource-friendly scraping.
Downloading unnecessary media (images, videos) is a common bandwidth drain. You can block these resources directly within your script.
Resource-blocking can occasionally impact page loading due to anti-bot expectations. If you see issues after blocking resources, revert your blocking logic before contacting support.
Copy
const page = await browser.newPage(); // Enable request interception await page.setRequestInterception(true); // Listen for requests page.on('request', (request) => { if (request.resourceType() === 'image') { // If the request is for an image, block it request.abort(); } else { // If it's not an image request, allow it to continue request.continue(); } });
Blocking media type requests alone may not always reduce your bandwidth usage. Some websites have ad spaces that continuously refresh ads, and others use live bidding mechanisms that constantly search for new ads if one fails to load properly.In such cases, it’s important to identify and block these specific network requests. Doing so will decrease the number of network requests and, consequently, lower your bandwidth usage.
One common inefficiency in scraping jobs is the repeated downloading of the same page during a single session.Leveraging cached pages - a version of a previously scraped page - can significantly increase your scraping efficiency, as it can be used to avoid repeated network requests to the same domain. Not only does it save on bandwidth by avoiding redundant fetches, but it also ensures faster and more responsive interactions with the preloaded content.
The selectors used in this example (.product-name, .product-price, .product-link, .apply-coupon-button) are generic placeholders. Please update these to match the actual HTML structure of the website you are scraping.
Also, make sure to replace https://example.com with your target URL.
Puppeteer
Copy
const puppeteer = require('puppeteer-core');const AUTH = 'USER:PASS';const SBR_WS_ENDPOINT = `wss://${AUTH}@brd.superproxy.io:9222`;async function scrapeProductDetails(link) { console.log('Connecting to Scraping Browser...'); const browser = await puppeteer.connect({ browserWSEndpoint: SBR_WS_ENDPOINT, }); try { console.log(`Connected! Navigating to: ${link}`); await page.goto(link, { timeout: 2 * 60 * 1000 }); // Wait for and extract product name await page.waitForSelector('.product-name', { timeout: 30000 }); const productName = await page.$eval('.product-name', el => el.textContent.trim()); // Try to apply coupon if button exists const couponButton = await page.$('.apply-coupon-button'); if (couponButton) { await couponButton.click(); } // Extract price await page.waitForSelector('.product-price', { timeout: 30000 }); const productPrice = await page.$eval('.product-price', el => el.textContent.trim()); return { productName, productPrice, link }; } catch (error) { console.error(`Error scraping ${link}:`, error.message); return null; } finally { await browser.close(); }}async function main() { console.log('Connecting to Scraping Browser...'); const browser = await puppeteer.connect({ browserWSEndpoint: SBR_WS_ENDPOINT, }); try { console.log('Connected! Navigating to listing page...'); const page = await browser.newPage(); await page.goto('https://example.com', { timeout: 2 * 60 * 1000 }); await page.waitForSelector('.product-link', { timeout: 30000 }); // Extract product links from the listing page const productLinks = await page.$$eval('.product-link', links => links.map(link => link.href).slice(0, 10) // Limit to first 10 for testing ); console.log(`Found ${productLinks.length} products`); await browser.close(); // Scrape product details in parallel const productDetailsPromises = productLinks.map(link => scrapeProductDetails(link)); const productDetails = await Promise.all(productDetailsPromises); // Filter out any null results from failed scrapes const validProductDetails = productDetails.filter(details => details !== null); console.log('Scraped product details:', validProductDetails); } catch (error) { console.error('Error during the main process:', error); }}main();