Scanning for broken Wix links

What do you do when a WordPress to Wix migration leaves you with 700 posts sprinkled with broken links? (Part 1: Find them)

Scanning for broken Wix links
An example page with a 200 response code.

I'm working this week to help a volunteer client (er.. I'm volunteering, not them) with their website. They had someone do a WordPress to Wix migration, and while their new site looks beautiful, when you dig in a few pages deep, there's still work to do.

My client had 10+ years of blog posts that their previous volunteer imported. The imports pretty much look fine, but many of those blog posts referenced other blog posts, with a mixture of URL formats, including the archive format from WP like domain.com/2012/02/04/pageslughere. Those are now mostly at domain.com/pageslughere. Except that sometimes pageslughere is not quite the same due to getting truncated differently (Wix likes 96 characters?), or having an apostrophe in the title that was handled differently, or... well, you get the idea. It's a bit of a mess, and there are 700+ blog post pages. So we needed some automation to find them, and then we'll need more to fix them.

How to find them:

1) You need Node & pm installed. I'm running everything on my local machine which I use for development anyway, so I didn't need to do anything extra here.

2) You need this broken link checker. I installed it in an empty directory with

npm install broken-link-checker

3) Make a new javascript file in that directory. Here's mine:

var { SiteChecker } = require("broken-link-checker");

const siteChecker = new SiteChecker(
    { 
        excludeInternalLinks: false,
        excludeExternalLinks: false, 
        filterLevel: 3,
        acceptedSchemes: ["http", "https"],
        // Add any sites that are giving false positives false positives. 
        excludedKeywords: ["linkedin"]
    },
    {
        "error": (error) => {
            console.error(error);
        },
        "html": (tree, robots, response, pageURL, customData) => {
            // Remove this line below if you don't want pages logged as they're scanned
            console.log('Scanning page:', pageURL)
            // This mess below is specific to the structure of the Wix page I was working with. I /think/ it's generalizable. I hope so, because it's an hour of my life I'm not getting back.
            try{ let myNode = tree.childNodes[1].childNodes[2].childNodes[17].childNodes[0].childNodes[2].childNodes[0].childNodes[1].childNodes[0].childNodes[0].childNodes[0].childNodes[1].childNodes[0].childNodes[0].childNodes[0].childNodes[0].childNodes[1].childNodes[0].childNodes[0].childNodes[0].childNodes[0].childNodes[0].childNodes[0].childNodes[3].childNodes[1]

            if (myNode.attrs[1].value == 'not-found-page'){
                // What specifically to log if you get a "wix friendly error"
                console.log("WIX_PAGE_NOT_FOUND => ", pageURL)}
            }
            catch {
                //console.log('not a blog page')
            }
        },
        "link": (result, customData) => {
            if(result.broken) {
                if(result.http.response && ![undefined, 200].includes(result.http.response.statusCode)) {
          // What to log if you get a non-200 status code (mostly 404s)
                    console.log(`${result.http.response.statusCode} => ${result.url.original}`);
                }
            }
        },
        "end": () => {
            // I find that sometimes I don't get this final log and the code just stops.  Not sure why at this point.
            console.log("COMPLETED!");
        }
    }
);
// be sure to update code below for your actual domain.
siteChecker.enqueue("https://domain.to.scan/");

4) And then I run it like this:

node thatjsfileabove.js > outputfile.txt

That writes turns those console.logs into an outputfile.

Here's documentation for the broken-link-checker:

GitHub - stevenvachon/broken-link-checker: Find broken links, missing images, etc within your HTML.
Find broken links, missing images, etc within your HTML. - GitHub - stevenvachon/broken-link-checker: Find broken links, missing images, etc within your HTML.

I'm also deeply grateful to the author of the page below, which provided the code and options for a 'typical' scanning case, that got me closer to my special Wix case:

Scan for Broken Links on a Website with JavaScript
Learn how to check for broken or unreachable links on a website using simple JavaScript.

If you try it, please let me know if it works for you! I only have one big Wix site to test on, so I'm not sure how generalizable my solution is.