![]() If so, we change the proxy and reload the page. Also, we check if Amazon has returned the access error. ![]() ![]() Then we go to the pool, load the next page from it. We will create a pool and put a start page in it. Our web scraper will be able to extract product information by any search request, so you can configure all the query filters in the browser and use the URL from the browser address bar as the start page in the config. Therefore, if there are more than 8000 products in your category and you want to collect the maximum quantity, then you need to revise the parameters of the search query, or you should collect products from subcategories. Take into account that the maximum number of pages in one category (or search query) given by Amazon is 400. It will allow us to describe the logic of parsing only once, for the entire pool, and not for each page separately. Since the category has a paginator and many catalog pages with products, we will use the pool. This function works both with the proxy specified in the config by the user and with the proxy in our cloud, which all users of the Diggernaut platform have access to. If the list ends, the scraper returns to the first proxy. When rotating the proxy, the digger selects the next proxy from the list. This mode allows you to loop the page request until we say that everything went well. To bypass the access error, we will use proxy rotation and repeat mode for the walk command. Namely, configure the parameters of the captcha_resolve command. In addition, you will have to change the scraper code a little. You will also need own account in one of these services. If you want to run the compiled digger on your computer, you will need to use one of the integrated services to solve the captcha: Anti-captcha or 2Captcha. Since this mechanism works as a microservice, it is available only when running the digger in the cloud, but it is free for all users of the Diggernaut platform. We will bypass the captcha with our internal captcha solution. Therefore, for the scraper to work successfully, we need to think about how it will catch and bypass these cases. For example, Amazon may show a captcha or a page with an error. Amazon can temporarily block the IP from which automated requests go. There is one more thing about which we want to tell you. To reduce the chance of blocking, we will also use pauses between requests. Of course, if you run the web scraper in the cloud. In this case, mixed proxies from our pool will be used. If you do not need a targeting by country, you can omit any settings in the proxy section. ![]() How to use them is described in our documentation in the link above. With a free account, you can use own proxy server. However, it only works with paid subscription plans. In our Diggernaut platform, you can specify geo-targeting to a specific country using the geo option. Therefore, if you are interested in information for the US market, you should use a proxy from the USA. Important points before starting developmentĪmazon renders the goods depending on the geo-factor, which is determined by the client’s IP address. Or, if you do not want to spend your time, you have the opportunity to hire our developers. If you wish, you can expand the dataset to be collected on your own. The tool will be designed to collect basic information about products from a specific category. Today we are going to build a web scraper for. Over 10 years of experience in data extraction, ETL, AI, and ML. Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |