• Home
  • Blog
  • Entity Extraction in Bulk using TextRazor’s API & Screaming Frog’s Custom JavaScript snippets

0 comments

Since Screaming Frog’s SEO Spider v20 came out, it has the ability to use custom JavaScript snippets to either extract data or perform an action.

It’s shipped with a couple of snippets already, but I’ll share you a custom one too.

Specifically, an extraction snippet that utilises TextRazor to extract entities in bulk using a tool you probably already use. Why you ask? Well, if you really have to ask please read the ‘Definitive Guide on Entity SEO’.

For those that don’t ask, let’s start.

Getting TextRazor’s API key

First things first, you need an API key from TextRazor, a text analysis tool. Good news is that it’s free, but capped on 500 requests per day. Getting the key is as easy as going to the signup page to create an account. After you’ve done this you login to see your API key.

When you finish this step it’s time for the fun stuff.

Setting up Screaming Frog

The Custom JavaScript snippets is a new function since v20. If you don’t have it yet, now’s the time to update that puppy. 

The snippet & how to add

// TextRazor Entity Extraction
//
// IMPORTANT:
// You will need to supply your API key below which will be stored
// as part of your SEO Spider configuration in plain text. 
// You can set 'languageOverride' if you want, but overall TextRazor
// is doing OK in identifying the language.

// API key and language override variables (keep your API key static in Screaming Frog)
const TEXTRAZOR_API_KEY = '';
const languageOverride = ''; // Leave blank if you don't want to override. Check https://www.textrazor.com/languages for supported languages

const userContent = document.body.innerText; // This captures the text content of the page

let requestCounter = 0; // Initialize request counter
const maxRequestsPerDay = 500; // Set the maximum requests per day, the free plan has a 500 requests per day

// The free plan has a limit of 2 concurrent requests, the delay will handle this
function delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
}

async function extractEntitiesWithDelay() {
    const entities = [];
    const chunkSize = 5000;
    const textChunks = [];

    for (let i = 0; i < userContent.length; i += chunkSize) {
        textChunks.push(userContent.substring(i, i + chunkSize));
    }

    for (let i = 0; i < textChunks.length; i++) {
        if (requestCounter >= maxRequestsPerDay) {
            console.log('Reached the maximum number of requests for the day.');
            break;
        }

        const text = textChunks[i];
        console.log('Sending text chunk to TextRazor:', text.slice(0, 200)); 

        const bodyParams = new URLSearchParams({
            text: text,
            extractors: 'entities,topics',
        });

        // Conditionally add the language override if it's provided
        if (languageOverride) {
            bodyParams.append('languageOverride', languageOverride);
        }

        const response = await fetch('https://api.textrazor.com/', {
            method: 'POST',
            headers: {
                'x-textrazor-key': TEXTRAZOR_API_KEY, 
                'Content-Type': 'application/x-www-form-urlencoded'
            },
            body: bodyParams.toString()
        });

        if (response.ok) {
            const data = await response.json();
            console.log('TextRazor response:', data); // Log the response for debugging

            if (data.response.entities) {
                entities.push(...data.response.entities);
                requestCounter++;
            }
        } else {
            const errorText = await response.text();
            console.error('TextRazor API error:', errorText);
        }

        if (i < textChunks.length - 1) {
            await delay(1000);
        }
    }

    return entities;
}

// First version had a lot of invalid entities, this filters a bunch of them
function isValidEntity(entity) {
    const invalidTypes = ["Number", "Cookie", "Email", "Date"];
    const entityId = entity.entityId || entity.matchedText;

    if (entity.type && Array.isArray(entity.type) && entity.type.length > 0) {
        if (invalidTypes.includes(entity.type[0]) || /^[0-9]+$/.test(entityId)) {
            return false;
        }
    } else if (/^[0-9]+$/.test(entityId)) {
        return false;
    }

    return true;
}

function processEntities(entities) {
    const entitiesDict = {};

    entities.forEach(entity => {
        if (isValidEntity(entity)) {
            const entityId = entity.entityId || entity.matchedText;
            const entityName = entity.matchedText.toLowerCase(); // Convert entity name to lowercase
            const freebaseLink = entity.freebaseId ? `https://www.google.com/search?kgmid=${entity.freebaseId}` : '';
            const wikiLink = entity.wikiLink || ''; // Ensure we're capturing the Wikipedia link correctly

            if (entityId !== 'None' && isNaN(entityName)) {  // Filter out numeric-only entities
                const key = entityName + freebaseLink; // Unique key based on name and link
                if (!entitiesDict[key]) {
                    entitiesDict[key] = {
                        entity: entityName,
                        count: 1,
                        freebaseLink: freebaseLink,
                        wikiLink: wikiLink
                    };
                } else {
                    entitiesDict[key].count += 1;
                }
            }
        }
    });

    const result = Object.values(entitiesDict).filter(item => item.entity && item.entity !== 'None'); // Filter out empty or 'None' entities

    return JSON.stringify(result);
}

return extractEntitiesWithDelay()
    .then(entities => {
        if (entities.length === 0) {
            console.warn('No entities found in the response.');
        }
        return seoSpider.data(processEntities(entities));
    })
    .catch(error => seoSpider.error(error));

To add the snippet you have to go to the configuration screen as seen below:

This will open a new screen where you have a couple of options. You can click on ‘Add from Library’ or just ‘Add’. The library is interesting. Here you have the embedded system library that’s shipped with a couple of snippets already. Then you have the user library where your saved snippets go. 

Also, you have the option to import/export snippets via a JSON file.

For now you just want to click on ‘Add’. This will open a new screen with an editor where you will paste the snippet. Here you will also add your API key in the config. After you’ve done this you can directly test the snippet on the right side where you can enter a URL.

Testing that puppy

For the snippet to work you need to have your API key setup. But that’s not all. A very important step is to go to your crawl configuration and set rendering to JavaScript. If you don’t, the snippets won’t execute.

As said before, the free version of TextRazor API will be capped on 500 requests per day. So the snippet is capped at that too. Crawling can be slow, since the API also only allows two concurrent requests so I had to add a delay.

When crawling  you get an extra column with the result in JSON format. It includes:

  • Entity name
  • Count of the entities
  • Freebase link (if included)
  • Wikipedia link (if included)

Next steps

Are up to you. You can export to Google Sheets, csv, etc. If you want an easy formatter for Google Sheets just leave a reply and I’ll see what I can share.


{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
>