Data collector start guide
Table of Contents
Search template: by domain
- You can search our ready-made collectors by the domain they collect from.
- You can also browse existing collectors by use case.
- For a domain-based search, input the domain here.
- Here you can see the results of your search, the collectors available for the domain or use case.
When the query is empty, popular collectors are shown.
Use additional features to get a focused results
- Use this tab to order results by 'Most Popular" or "Last Modified."
- Use "Filters" to get more specific results.
- Filter your results by:
- Web page - Collects all data from a website by URL.
- Search - Collects all data from a wesbite via keyword.
- Discovery - Collects all data from a website by category.
- Bright data - Public collectors created by Bright Data that are available for all customers.
- Customer - Customers who developed a collector with the IDE. These templates are monitored by the customer (admin).
Search template: by use-case
- In use-case view, you can browse our collector repository, based on the use case they’re aimed to solve, like ‘Social networks’, ‘Online travel’ and so on.
Search template: domain doesn’t exist
- Can’t find a template for your need?
- You can develop a collector on your own, using our in-house IDE, equipped with all sorts of tools (like methods and preview tools) to help you in the development.
- You can fill out a shot form describing your use-case and data needs and we’ll be in touch with you in order to develop the collector for you!
Ask for custom collector
- Describe your data collection workflow process.
Please try to be as descriptive as you can.
- You can upload any files that can help us understand your needs. Good examples: a file containing a set of inputs you would like to provide the collector (URLs, search terms, list of hashtags, etc.), a screenshot of the page type you want to collect, with all the important data fields you want highlighted.
- What URLs would you like to collect data from?
Template chosen: Overview
- Here you can see the Collector's information:
- Ownership - "Bright data" (collectors that are monitored by us and available for all customers) or "Customer" (developed by the customer in the IDE).
- Type - "Web page" (gets a URL of a specific page and collects data from this page alone); "Search" (collector that typically gets a keyword input and searches for it on the website, providing links and basic information for the search results); or "Discovery" (gets a URL of a category and returns URLs of all the pages on the website within that category).
- Use case - Brand industry (e.g. Online video-sharing).
- Last modification time.
- In this tab, you can see general information about the chosen template.
- In this tab, you can test the template in real-time and make sure you get all you need from it.
- Here you can find an example of the output this template returns - useful to see all the different data fields it collects.
- In this tab, you’ll be able to find the way to trigger this template using our API. in order to activate it you need to finish the process.
- Here you can find the source code of this template. Once added to your collector list, you’ll also be able to edit it to better accommodate your needs.
- Here you can see a thorough description of the template.
Template chosen: Test collector
- In this tab, you can test the template in real time.
- In the testing phase, you will get a sample of the results for the input (mostly relevant for discovery collectors).
- Here you can give the input you would like to test (In this example, the URL of a publicly available facebook profile).
- Click here to run a realtime test of the template on the input you provided.
- The test results will be presented here once it’s finished.
Template chosen: Output example
- Here you'll find an example of the output this template returns.
- You can view the result example of the template in:
- CSV format (table).
CSV flat format, where all fields contain texture data only (rather than complicated objects like lists or jsons).
- JSON format.
- Note that for JSON view, only the first 5,000 characters will be displayed.
- Click here to download the result as a file.
Template chosen: API
- In this tab, you’ll be able to find the way to trigger this template using our API. In order to activate it you need to finish building your collector.
- Once you’ve added a collector based on this template, you’ll be able to trigger it using this API (providing the actual COLLECTOR_ID in the appropriate place).
- Note that you can’t trigger a collector before actually adding it.
Template chosen: Source code
- Here you'll find the source code of this collector. Once added to your collector list, you’ll also be able to edit it to better accommodate your needs.
- Here you see the source code for this collector.
- Click here to view the extended source code of this template in the IDE.
Note that you won’t be able to modify the code before adding a collector.
Collector delivery configurations
- Choose when to receive your data:
a. On a job completion - Send bulk of requests and receive them when all are ready.
b. Realtime - Trigger a request that will start running immediately.
- You can split the delivery into small batches (if your server cant handle large amount of result of for any other reason), and receive it as soon as it's ready.
*Available only "at a job's completion"
- You can set a specific name for the report.
- If you want, you can also get notifications about your collections when finished, either by mail or webhook
* Only available with "storage methods."
- Provide connection details for us to send the notifications.
- You can test and make sure we’re able to actually send the notifications to you.
Collector output configurations
- On this page, you can change the way the collector will format each output field in the results.
- Here you see a list of all the data fields in this template’s results.
- Use this button to drag the fields and change the order in which they will be provided in the results.
- Use this button to ‘turn off’ unwanted fields, meaning you won’t get them in the results.
- This is the data point name in the report.
- If you define a field as 'Required' (mandatory), you will neither receive, nor be charged for a record that doesn’t have this field. You will get this record with an error only.
- Timestamp - if enabled, result will include the actual time of collection.
- Input - if enabled, results will include the input you provided to the collector (recommended).
- Warnings - this option is always on, and will provide warnings in the results, if relevant.
- Error - this options is set on, and will include system errors in the result (if any).
- Screenshots - if enabled, will send additional file for each collected webpage with a screenshot of the collected page (doesn't work with email delivery method).
- HTML - if enabled, will send additional file for each collected webpage with the entire HTML of the collected page (doesn't work with email delivery method).
- You can change the name of the field.
- This is the format of this data field (URL, number, text, image, etc).
- Check this box to have the collector validate the result is of the expected format. Otherwise, fail the record (meaning you don't have to pay for it).
- For certain URL types (like images), you can also receive the data itself, not only collect its URL.
Initiate by API - Getting the data 'on a job completion'
- Regular request will run the request and return the result. Otherwise, it will return an error message.
Queue request will add the request to the end of the job queue. Otherwise, it will start running the request.
Replace request will submit a new request and cancel the existing one.
- Replace this part with the input you’d like to run. Note that for better performance, you should send all your inputs in a single request so the collector can run multiple inputs in parallel.
- Choose which API token to use for running the collection.
Only people with this token will be able to run this collector.
- Once you successfully trigger your collector, you'll get the data-set for this request.
Initiate by API - Getting the data in 'real time'
- First, trigger a real-time collection job for the input you provide. You'll immediately receive a response with the created job's ID.
- After receiving a response, replace 'Response_ID' with the job ID from the triggering API call and receive the result of the data collection.
- Provide the input you would like the collector to use.
- You can add more inputs to the same run.
- Alternatively, you can upload a CSV file that contains all the input lines you would like to run.
- Here you can download a template for the CSV, which you can fill with the inputs and upload to the collector.
Scheduling a collector
- Enter when you’d like your scheduled collector to run for the first time.
Specify if you'd like the scheduler to:
- Run forever.
- Only repeat the scheduled collection for a certain amount of times.
- End the collection at certain time of your choice.
- Provide the scheduled collection interval (e.g. start on 01-12-2020, at 12:00AM, forever, every one hour).
- Choose which days of the week the scheduler should include (in this example, all but Wed, Fri).
- Choose when the scheduler should start to apply (change apply to run), within a certain day, and when the scheduler should stop, within a certain day.
Collectors list main page
- This is the full view of your collectors list. Every collector you create, will be added here.
- Use the search bar to search for specific collectors.
- Use the date picker to show collectors details and info at a specific date or dates range.
- In each of these cards, you can find all of the relevant information for your collectors.
- Click on this icon to open the collector actions menu.
- Click on this icon to open an expanded view of this collector to see more details.
- Click here to add a new collector.
Collector actions menu
- Each collector has an actions menu.
Through this menu you can:
- Run the collector.
- Edit the collector code in the IDE.
- View the collector statistics.
- Delete/disable the collector.
Collector card expanded view
- This is the expanded view of the collector.
- View the collector's statistics. You can go to statistics main page by clicking on the top right icon.
- View the collector's properties.
- View the collector's delivery preferences. You can edit it by clicking on the top right icon.
- View the collector's output fields. You can edit it by clicking on the top right icon.
- This page provides an overview of each collector’s run statistics.
- This icon indicates that this specific job is currently running.
- If a number appears next to the icon, this indicates that additonal attempts at completing
the run are in progress. When the run is complete, the number will dissapear.
TIP: Wait for the full completion of a run before analyzing your stats.
- This icon will show a report with your collected data.
- This icon allows you to download the results to you computer.
- This icon re-runs your request another time (in case it failed the first time).
- This icon shows you the complete run performance log (including success, fail rates, and errors).
- This icon appear only for failed jobs.
Clicking on this icon will take you to the IDE to view the error(s) that occuurd in this run.