May 7, 2025
Change can mean opportunity. A key component of seizing opportunity is having strong/superior information.
Web scraping is not new, but LLM AI provides another tool for automated product matching when SKU matching isn’t enough.
The biggest gotcha with web scraping is to prevent your traffic from getting flagged as suspicious by the scraped site — there are many ways to manage this, as discussed in concluding sections.
It’s possible that your industry may be shifting quickly enough that your relative position is changing without your realizing it. We want to know about these as early as possible, while they’re still in their infancy.
Information we’re typically looking to keep track of:
A strategically formulated cart can be put together on any shopping site, and the changes to those cart items can be tracked more easily than a full-fledged catalog scraping (similar to economists’ tracking of staple grocery baskets).
Initially this can all be done manually — an actual person builds the cart and puts the prices into a spreadsheet. Then automate regeneration of the cart when the cart is cleared (many sites do this after a certain number of days). Then automate scraping of the prices of the cart items. Then automate tracking of other information about the cart items (ex: manufacturer or source country).
If you have an ecommerce site that uses manufacturer model numbers shared with other sites, then your next step in building out automated competitor intelligence gathering is direct comparisons of identical SKUs. This is simple to build, almost mechanical.
After that, another quick payout relative to upfront cost is usually obtained by characterizing the competitor offerings catalogs and detecting changes in them without trying to match to items in your own store or additional competitor sites. For example, it may be helpful to know that the number of SKUs your competitor is offering in the category of ball bearings has stayed roughly the same, but the number of SKU deletions/additions within the category has lately been much higher than usual. You can also track pricing changes over time, and also see that the sourcing mix has changed from 80% from China to 55% from China.
Thoughtfully generated reports — both at a line item and aggregate level — can be quickly skimmed by a knowledgeable human for notable changes, and further manual investigation can be done in a highly targeted manner. This should yield a credible core understanding of sourcing changes, pricing changes, and offering changes.
As one can imagine, cross-catalog comparison can be easier or tougher depending on the state of the two catalogs. More structure obviously makes things much easier. Use language similarity tools to match both top level categories and descriptors.
Within your own catalog, customer search terms and browsing journeys can be used to expand the descriptors of an item. In most cases it’s best to use “hacky” methods first, manually review, and see how far you have left to go.
LLMs, grammar parsers, and some nifty NLP tricks can often help you pull out a reasonable taxonomy for an item if you only have blobs of text to work from. Search history before purchasing is also massively helpful. Relying on brute force tools alone is unlikely to finish the job, though — there’s always room for context-specific human cleverness.
Commerce sites beyond a basic level of sophistication will usually have monitors alerting for suspicious or bot traffic, especially if the site does not intend to sell via third party listings (ex: Google Shopping). The approach one takes to avoiding getting blocked is case dependent, but there are a few themes:
A fallback plan might be as simple as the “representative cart” method above.
Couple an automation tool like Selenium with simple comprehension bots to monitor anything unusual happening at checkout, such as new fees or discounts.