Reputation: 2085

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, <a href="*http://www.google.com*"></a>. I want to get only the http://www.google.com part.

When I run a grep,

cat website.htm | grep -E '<a href=".*">' > links.txt

this returns the entire line to me that it found on not the link I want, so I tried using a cut command:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d’”’ --output-delimiter=$'\n' > links.txt

Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’”’ --output-delimiter=$'\n' grepedText.txt > links.txt.

But I wanted to do this in one command if possible... So I tried doing an AWK command.

cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’

But this wouldn't run either. It was asking me for more input, because I wasn't finished....

I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....

The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.

P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.

I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.

I also want to point out I'm in the correct directory

Enter image description here

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

Expected output:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.

Upvotes: 30

Answers (8)

I. Marin

Reputation: 1

Here's a dash script (for Linux), which can compare the URLs in [ the first (file/folder) parameter ] compared to URLs in [ the other (files and/or folders) parameters ] (call this script with the --help flag to find out how to use it):

#!/bin/dash
## Supported shells: dash, bash, zsh, ksh

GetOSType () {
    #$1 = returns the type of the current operating system:
    
    case "$(uname -s)" in
    *"Linux"* )
        eval $1="Linux"
        ;;
    *"Darwin"* | *"BSD"* )
        eval $1="BSD-based"
        ;;
    * )
        eval $1="Other"
        ;;
    esac
}

StoreURLsWithLineNumbers () {
    #Adds a line number before each found URL, on a separate line:
    
    count_all="0"
    mask="00000000000000000000"
    
    #For <file group 1>: initialise next variables:
    file_group="1"
    count=0
    
    commandN_text=""
    if [ ! "$commandN_flag" = "0" ]; then
        commandN_text="Step $commandN_flag - "
    fi
    
    for line in $(PrintURLs file_parameters_1 1 1; printf '\n%s\n' "### Separator ###";    for i in $(GenerateSequence 2 $file_parameters_0); do eval current_param="\$file_parameters_$i"; PrintURLs current_param $i 2; done;); do
        if [ "$line" = "### Separator ###" ]; then
            eval lines$file_group\1\_0=$count
            eval lines$file_group\2\_0=$count
            
            #For <file group 2>: initialise next variables:
            file_group="2";
            count="0"
            continue;
        fi
        
        printf "\033]0;%s\007" "Storing URLs into memory [$commandN_text""group $file_group]: $(($count + 1))...">"$print_to_screen"
        count_all_prev=$count_all
        count_all=$(($count_all + 1))
        count=$(($count + 1))
        if [ "${#count_all_prev}" -lt "${#count_all}" ]; then
            mask="${mask%?}"
        fi
        number="$mask$count_all"
        
        eval lines$file_group\1\_$count=\"\$number\"
        eval lines$file_group\2\_$count=\"\$line\" #URL
    done;
    eval lines$file_group\1\_0=$count
    eval lines$file_group\2\_0=$count
}

PrintURLs () {
    #Prints the found URL's:
    #$1 = input variable name containing the search path
    #$2 = input number representing the current parameter position (number)
    #$3 = input number representing the current file group
    
    if [ "$domains_flag" = "1" ]; then
        extract_urls_command="$get_domains_command"
    fi
    {
        eval path_to_search="\"\$$1\""
        current_file_number="$2"
        eval stored_file=\"\$stored_file_parameters_$current_file_number\"
        current_file_group="$3"
        
        eval current_mime_type=\"\$mime_types_$current_file_number\"
        
        current_count="$current_file_number"
        if [ "$current_file_group" = "2" ]; then
            current_count=$(($current_count - 1))
        fi
        printf "\033]0;%s\007" "Loading files from group [$current_file_group] - param. $current_count...">"$print_to_screen"
        if [ "$current_mime_type" = "directory" ]; then
            cd "$path_to_search"
            
            if [ ! "$skip_non_text_files_flag" = "1" ]; then
                current_path='.'
                Find_Print_DOCX; Find_Print_XLSX; Find_Print_PPTX_PPSX; Find_Print_OPEN_DOC; Find_Print_PDF; Find_Print_TEXT_FILES_IN_ARCHIVE;
            fi
            
            eval find "\$path_to_search $find_negation -type d $find_negation -path '.' -a \( "$find_parameters" \)"|{
                while IFS= read -r current_path; do
                    GetMimeType current_path current_mime_type
                    if [ ! "$current_mime_type" = "binary" ] && [ ! "$current_mime_type" = "undetermined" ]; then
                        Find_Print_TEXT_FILES;
                    fi
                done
            }
        else
            current_path="$path_to_search"
            Find_Print_DOCX; Find_Print_XLSX; Find_Print_PPTX_PPSX; Find_Print_OPEN_DOC; Find_Print_PDF; Find_Print_TEXT_FILES_IN_ARCHIVE;
            Find_Print_TEXT_FILES;
        fi
        printf "\033]0;%s\007" "Extracting URLs from group [$current_file_group]...">"$print_to_screen"
    } 2>/dev/null|eval "$extract_urls_command"
}

GFC () { # Generate Find Command #
    #Function for running a find command containing the <cmd> to execute by find ($1 variable content), the <file extension(s) to match> ($2 .. $N) and the <find parameters> to match:
    
    eval cmd="\"\$$1\""
    while [ -n "$2" ]; do
        eval find "\$current_path $find_negation -type d $find_negation -path '.' -a \( -name '$2' \) -a \( $find_parameters \) $cmd \;"
        shift
    done
}

Find_Print_DOCX () {
    cmd=" -exec unzip -q -c '{}' 'word/_rels/*'"
    GFC cmd '*.docx'
}

Find_Print_XLSX () {
    cmd=" -exec unzip -q -c '{}' 'xl/worksheets/_rels/*'"; GFC cmd '*.xlsx'
}

Find_Print_PPTX_PPSX () {
    cmd=" -exec unzip -q -c '{}' 'ppt/slides/rels/*'"; GFC cmd '*.pptx' '*.ppsx'
}

Find_Print_OPEN_DOC () {
    cmd=" -exec unzip -q -c '{}' 'content.xml'"; GFC cmd '*.odt' '*.ods' '*.odp'
}

Find_Print_PDF () {
    cmd=" -exec pdftohtml -i '{}' -stdout"; GFC cmd '*.pdf'
}

Find_Print_TEXT_FILES_IN_ARCHIVE () {
    cmd=" -exec unzip -q -c '{}'"; GFC cmd '*.zip'
    cmd=" -exec 7z e '{}'"; GFC cmd '*.7z'
    cmd=" -exec unrar p '{}'"; GFC cmd '*.rar'
    cmd=" -exec bzip2 -dc '{}'"; GFC cmd '*.bz2'
    cmd=" -exec xz -dc '{}'"; GFC cmd '*.xz'
    cmd=" -exec gzip -dc '{}'"; GFC cmd '*.gz'
    cmd=" -exec tar -xOf '{}'"; GFC cmd '*.tar' ## ERROR: NO OUTPUT GENERATED (LINUX)
    cmd=" -exec tar -xOzf '{}'"; GFC cmd '*.tgz' ## ERROR: NO OUTPUT GENERATED (LINUX)
    cmd=" -exec tar -xOjf '{}'"; GFC cmd '*.tar.bz2'
    cmd=" -exec tar -xOJf '{}'"; GFC cmd '*.tar.xz'
    cmd=" -exec tar -xOzf '{}'"; GFC cmd '*.tar.gz'
}

Find_Print_TEXT_FILES () {
    if [ "$current_mime_type" = "text" ]; then
        cmd=" -exec cat '{}'"; GFC cmd '*'
    elif [ "$current_mime_type" = "device" ]; then
        printf '%s' "$stored_file"
    fi
}

GetMimeType () {
    #$1 = input variable name containing the file path of the file to analyze
    #$2 = output variable containing the mime type of the file having the path stored in the $1 input variable
    
    eval current_file_path=\"\$$1\"
    
    if [ ! "${current_file_path#"/dev/"*}" = "${current_file_path}" ]; then
        result="device"
    else
        if [ -d "$current_file_path" ]; then
            result="directory"
        else
            current_file_mime_type="$(file -bL --mime-type "$current_file_path" 2>/dev/null)"
            case "$current_file_mime_type" in
                *"application/x-"* )
                    result="binary"
                ;;
                *"text/"* )
                    result="text"
                ;;
                *"inode/symlink"* )
                    result="directory"
                ;;
                *"application/pdf"* )
                    result="PDF"
                ;;
                * )
                    result="undetermined"
                ;;
            esac
        fi
    fi
    eval $2=\"\$result\"
}

GetMimeTypes () {
    #Generates the <mime_types> array for each file having the file path in the <file_parameters> array:
    
    for j in $(GenerateSequence 1 $file_parameters_0); do
        eval current_file_path="\"\$file_parameters_$j\""
        GetMimeType current_file_path mime_types_$j
    done
    mime_types_0="$file_parameters_0"

}

GenerateSequence () {
    #Prints the sequence of numbers $1 .. $2 <-> if $1 value < $2 value:
    
    sequence_start=$(($1))
    sequence_end=$(($2))
    if [ "$sequence_start" -le "$sequence_end" ]; then
        seq $sequence_start $sequence_end
    fi
}

ExtractFirstAndLastPathComponent () {
    #$1 = input path
    #$2 = returns the first path component
    #$3 = returns the last path component
    
    eval current_path="\"\$$1\""
    
    first_path_component=""
    last_path_component=""
    
    if [ -n "$current_path" ]; then
        #Remove trailing '/' characters:
        while [ ! "${current_path%"/"}" = "$current_path" ]; do
            current_path="${current_path%"/"}"
        done
        
        if [ -z "$current_path" ]; then
            eval current_path=\"\$$1\"
        fi
        
        last_path_component="${current_path##*"/"}"
        first_path_component="${current_path%"$last_path_component"}"
    fi
    
    eval $2="\"\$first_path_component\""
    eval $3="\"\$last_path_component\""
}

GetCurrentShell () {
    #$1 = returns the <current shell name>
    #$1 = returns the <current shell full path>
    
    if [ -n "$BASH_VERSION" ]; then current_shell_name="bash";
    elif [ -n "$ZSH_VERSION" ]; then current_shell_name="zsh";
    elif [ -n "$KSH_VERSION" ]; then current_shell_name="ksh";
    elif [ "$PS1" = '$ ' ]; then current_shell_name="dash";
    else current_shell_name="dash"; #default shell
    fi
    current_shell_full_path="$(which "$current_shell_name")"
    
    eval $1=\"\$current_shell_name\"
    eval $2=\"\$current_shell_full_path\"
}

PrintArrayElements () {
    #Prints the $1 array' elements:
    
    eval pae_array_count="\"\$$1_0\""
    for i in $(GenerateSequence 1 $pae_array_count); do
        eval current_param="\"\$$1_$i\""
        printf '%s\n' "$current_param"
    done
    if [ "$pae_array_count" = "0" ]; then printf '%s\n' "<none>"; fi
    printf "\n"
}

PrintErrorExtra () {
    #Prints the <command path>, and: <flag>, <file paths> and <find> parameters:
    
    {
    
    printf '%s\n' "Command path:"
    printf '%s\n' "$current_shell_full_path '$current_script_path'"
    
    printf "\n"
    
    #Flag parameters are printed non-quoted:
    printf '%s\n' "Flags:"
    PrintArrayElements flag_parameters
    
    #Path parameters are printed quoted with '':
    printf '%s\n' "Paths:"
    PrintArrayElements file_parameters

    #Find parameters are printed quoted with '':
    printf '%s\n' "'find' parameters:"
    PrintArrayElements find_parameters
    
    }>&2
}

PrintErrorMessage () {
    #Prints the total string parameters received by this function as a single error message:
    
    printf '\n%s\n' "${@}">&2
}

PrintErrorMessageAndSetError () {
    #Prints the total string the parameters received by this function as a single error message and sets the <error> variable to "true":
    
    printf '\n%s\n' "${@}">&2
    error="true"
}

PrintWarningMessage () {
    #If the <ignore warnings flag> is "0" (unset): Prints the total string the parameters received by this function as a single warning message:
    
    if [ "$ignore_warnings_flag" = "0" ]; then
        printf '\n%s\n' "${@}">&2
    fi
}

CheckUtilities () {
    #Check if any of the necessary utilities (received as string parameters by this function) is missing:
    
    msg_type="warning"; msg_prefix="WARNING"
    warning="false"
    if [ "$1" = '--warning' ]; then
        shift
    elif [ "$1" = '--warning-message' ]; then
        eval message="\"\$$2\""
        shift; shift
    else
        msg_type="error"; msg_prefix="ERROR"
        error="false"
    fi
    
    for utility; do
        which $utility >/dev/null 2>/dev/null || {
            message="$msg_prefix: the '$utility' utility is not installed!"
            if [ "$msg_type" = "warning" ]; then
                PrintWarningMessage "$message"
            elif [ "$msg_type" = "error" ]; then
                PrintErrorMessage "$message"
            fi
            eval $msg_type="\"true\""
            eval $msg_type\_all="\"true\""
        }
    done
    
    if [ "$error" = "true" ]; then
        printf "\n"
        CleanUp && exit 1
    fi>&2
}

trap1 () {
    #Calls the 'CleanUp' function after pressing "CTRL + C / CTRL + Z" and ends all the processes started by this script:
    
    CleanUp
    #if not running in a subshell: print "Aborted"
    if [ "$commandN_flag" = "0" ]; then
        printf "\n""Aborted.""\n">"$print_to_screen"
    fi
    
    #kill all children processes, suppressing "Terminated" message:
    kill -s PIPE -- -$$
    
    exit
}

CleanUp () {
    
    #Restore "INTERRUPT" (CTRL + C) and "TERMINAL STOP" (CTRL + Z) signals:
    trap - INT
    trap - TSTP
    
    #Clear the title:
    printf "\033]0;%s\007" "">"$print_to_screen"
    
    #Restore initial IFS:
    #IFS="$initial_IFS"
    unset IFS
    
    #Restore initial directory:
    cd "$initial_dir"
    
    DestroyArray flag_parameters file_parameters stored_file_parameters find_parameters lines11 lines12 lines21 lines22
}

DestroyArray () {
    #Frees memory occupied by the arrays $1 .. $N:
    
    while [ -n "$1" ]; do
        eval eval array_length=\'\$$1\_0\'
        if [ -z "$array_length" ]; then array_length=0; fi
        for i in $(GenerateSequence 1 $array_length); do
            eval unset $1\_$i
        done
        eval unset $1\_0
        shift
    done
}

ProcessGroup () {
    eval lines_0=\$lines$1\1_0
    for i in $(GenerateSequence 1 $lines_0); do
        printf "\033]0;%s\007" "Processing group [$1] - URL: $i...">"$print_to_screen"
        eval printf \"\%s\\n\" \"\$angle_bracket_$1 \$lines$1\1_$i \$lines$1\2_$i\"
    done|{
        if [ "$preserve_order_flag" = "0" ]; then
            sort -k 3|uniq -c -f 2
        else
            uniq -c -f 1
        fi
    }
}

SortAndFilter1 () {
    if [ "$preserve_order_flag" = "0" ]; then
        sort -k 4|SortAndFilter2 "3"
    else
        SortAndFilter2 "2"
    fi
}

SortAndFilter2 () {
    field_to_be_sorted=$1
    if [ "$different_flag" = "1" ]; then
        if [ "$preserve_order_flag" = "0" ]; then
            uniq -u -f 3|sort -k $field_to_be_sorted|eval "$prepare_for_output_command"
        else
            eval "$prepare_for_output_command"
        fi
    elif [ "$common_flag" = "1" ]; then
        if [ "$preserve_order_flag" = "0" ]; then
            uniq -d -f 3|sort -k $field_to_be_sorted|eval "$prepare_for_output_command"|eval "$remove_angle_brackets_command"
        else
            eval "$prepare_for_output_command"|eval "$remove_angle_brackets_command"
        fi
    fi
}

DisplayHelp () {
    cat<<'EOF'

 - UNIQue Links - a dash shell script to compare URLs ( examples of considered URLs here: ...://... OR x.y... ): in [the first (file/folder) parameter [= group 1]] compared to [the other (files and/or folders) parameters [= group 2]]
     
     Usage:
         
         dash '/path/to/this/script.sh' <flags> '/path/to/file1' ... '/path/to/fileN' [ --find-parameters <find_parameters> ]
         where:
         - The group 1: '/path/to/file1' and the group 2: '/path/to/file2' ... '/path/to/fileN' - are considered the two groups of files to be compared
         
         - <flags> can be:
             --help
                 - displays this help information
             --different or -d
                 - find URLs that differ (default flag)
             --common or -c
                 - find URLs that are common
             --domains
                 - compare and print only the domains (plus subdomains) of the URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag
             --domains-full
                 - compare only the domains (plus subdomains) of the URLs but print the full URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag
             --preserve-order or -p
                 - preserve the order and the occurences in which the links appear in group 1 and in group 2
                 - Warning: when using this flag - process substitution is used by this script - which does not work with the "dash" shell (throws an error). For this flag, you can use other "dash" syntax compatible shells, like: bash, zsh, ksh
             --skip-non-text
                 - skip non-text files from search (does not look into: .docx, .xlsx, .pptx, .ppsx, .odt, .ods, .odp, .pdf, .zip, .7z, .rar, .bz2, .xz, .gz, .tar.*, files)
             --go-to-eol
                 - once a match is found: go to the end of line (include everything starting from the match to the end of the line)
             --include-prefix
                 - for each match: include the preceding characters before the URL (i.e. include any consecutive characters before the URL, that are not any of the characters stored in the "PREFIX_DELIMITERS" variable)
             --use-only-terminal or -t
                 - uses only the terminal for printing the output
             --ignore-warnings or -i
                 - does not print warning messages
             --find-parameters <find_parameters>
                 - all the parameters given after this flag, are considered 'find' parameters
                 - <find_parameters> can be: any parameters that can be passed to the 'find' utility (which is used internally by this script) - such as: name/path filters
             -h
                 - also look in hidden files
     
     Output:
         - '<' - denote URLs from the group 1: '/path/to/file1'
         - '>' - denote URLs from the group 2: '/path/to/file2' ... '/path/to/fileN'

EOF
}


set +f #Enable globbing (POSIX compliant)
setopt no_nomatch 2>/dev/null #Enable globbing (zsh)

print_to_screen="/dev/tty" #Print to screen only

initial_dir="$PWD" #Store initial directory value

initial_IFS="$IFS" #Store initial IFS value

Q="'"

GetOSType OS_TYPE


#Trap "INTERRUPT" (CTRL + C) and "TERMINAL STOP" (CTRL + Z) signals:
trap 'trap1' INT
trap 'trap1' TSTP

find_parameters=""

if [ "$OS_TYPE" = "Linux" ]; then
    find_negation='!'
fi

#Process parameters:

different_flag="1" #default flag
common_flag="0"
domains_flag="0"
domains_full_flag="0"
preserve_order_flag="0"
goto_EOL_flag="0"
include_prefix_flag="0"
use_only_terminal_flag="0"
command1_flag="0"
command2_flag="0"
command3_flag="0"
command4_flag="0"
commandN_flag="0"
skip_non_text_files_flag="0"
find_parameters_flag="0"
hidden_files_flag="0"
ignore_warnings_flag="0"
help_flag="0"

flag_parameters_count="0"
file_parameters_count="0"
find_parameters_count="0"

for parameter; do
    if [ "$find_parameters_flag" = "0" ]; then
        case "$parameter" in
            "--different" | "-d" | "--common" | "-c" | "--domains" | "--go-to-eol" | "--include-prefix" | \
            "--domains-full" | "--preserve-order" | "-p" | "--ignore-warnings" | "-i" | "--use-only-terminal" | "-t" | \
             "--command1" | "--command2" | "--command3" | "--command4" | "--skip-non-text" | "--find-parameters" | "-h" | \
            "--help" )
                flag_parameters_count=$(($flag_parameters_count + 1))
                eval flag_parameters_$flag_parameters_count=\"\$parameter\"
                case "$parameter" in
                    "--different" | "-d" )
                        different_flag="1"
                        common_flag="0"
                    ;;
                    "--common" | "-c" )
                        common_flag="1"
                        different_flag="0"
                    ;;
                    "--domains" )
                        domains_flag="1"
                    ;;
                    "--domains-full" )
                        domains_full_flag="1"
                    ;;
                    "--preserve-order" | "-p" )
                        preserve_order_flag="1"
                    ;;
                    "--go-to-eol" )
                        goto_EOL_flag="1"
                    ;;
                    "--include-prefix" )
                        include_prefix_flag="1"
                    ;;
                    "--use-only-terminal" | "-t" )
                        use_only_terminal_flag="1"
                    ;;
                    "--command1" )
                        command1_flag="1"
                        commandN_flag="1"
                    ;;
                    "--command2" )
                        command2_flag="1"
                        commandN_flag="2"
                    ;;
                    "--command3" )
                        command3_flag="1"
                        commandN_flag="3"
                    ;;
                    "--command4" )
                        command4_flag="1"
                        commandN_flag="4"
                    ;;
                    "--skip-non-text" )
                        skip_non_text_files_flag="1"
                    ;;
                    "--find-parameters" )
                        find_parameters_flag="1"
                    ;;
                    "--ignore-warnings" | "-i" )
                        ignore_warnings_flag="1"
                    ;;
                    "-h" )
                        hidden_files_flag="1"
                    ;;
                    "--help" )
                        help_flag="1"
                    ;;
                esac
            ;;
            * )
                file_parameters_count=$(($file_parameters_count + 1))
                eval file_parameters_$file_parameters_count=\"\$parameter\"
            ;;
        esac
    elif [ "$find_parameters_flag" = "1" ]; then
        find_parameters_count=$(($find_parameters_count + 1))
        eval find_parameters_$find_parameters_count=\"\$parameter\"
    fi
done
flag_parameters_0="$flag_parameters_count"
file_parameters_0="$file_parameters_count"
find_parameters_0="$find_parameters_count"

if [ "$help_flag" = "1" ] || ( [ "$file_parameters_0" = "0" ] && [ "$find_parameters_0" = "0" ] ); then
    DisplayHelp
    exit 0
fi


#Store New Line and Tab for use with sed:
if [ "$OS_TYPE" = "Linux" ]; then
    NL=$(printf '%s' "\n")
    TAB=$(printf '%s' "\t")
fi

angle_bracket_1='<'
angle_bracket_2='>'

#By URL, it is meant a website link in the form: ...://... OR x.y...
if [ "$goto_EOL_flag" = "1" ]; then
    NON_EOL_CHARS='^$'
else
    NON_EOL_CHARS='^ ^'"${TAB}"'^>^<'
fi
if [ "$include_prefix_flag" = "1" ]; then
    
    PREFIX_DELIMITERS='\t' #PREFIX DELIMITERS FOR RECOGNIZING THE PREFIX (CURRENTLY: TAB)
    
    PREFIX_CHARS='['"$PREFIX_DELIMITERS"']*([^'"$PREFIX_DELIMITERS"']*)'
    SED_REPLACE_SEQUENCE='\1\2\3\7'
else
    PREFIX_CHARS='([^a-zA-Z]*)'
    SED_REPLACE_SEQUENCE='\2\3\7'
fi
insert_NL_before_and_after_URLs_command='sed -E '"'"'s/'"${PREFIX_CHARS}"'([a-zA-Z]+\:\/\/){0,1}((([a-zA-Z]+[0-9\-]*)+(\.[a-zA-Z]+[0-9\-]*)+)+)((['"${NON_EOL_CHARS}"'])*)/'"${NL}""${SED_REPLACE_SEQUENCE}""${NL}"'/g'"'"
strip_NON_FULL_URL_text_command='sed -E '"'"'s/'"${PREFIX_CHARS}"'([a-zA-Z]+\:\/\/){0,1}((([a-zA-Z]+[0-9\-]*)+(\.[a-zA-Z]+[0-9\-]*)+)+)((['"${NON_EOL_CHARS}"'])*).*/'"${SED_REPLACE_SEQUENCE}"'/g'"'"
strip_NON_domain_text_command='sed -E '"'"'s/((([a-zA-Z]+[0-9\-]*)+(\.[a-zA-Z]+[0-9\-]*)+)+)((['"${NON_EOL_CHARS}"'])*).*/\1/g'"'"
delete_lines_not_containing_an_URL='sed -E '"'"'/.*((([a-zA-Z]+[0-9\-]*)+(\.[a-zA-Z]+[0-9\-]*)+)+).*/!d'"'"
prepare_for_output_command='sed -E '"'"'s/ *([0-9]*)[\ *](<|>) *([0-9]*)[\ *](.*)/\2 \4 \1/g'"'"
remove_angle_brackets_command='sed -E '"'"'s/(<|>) (.*)/\2/g'"'"
extract_urls_command="$insert_NL_before_and_after_URLs_command|$strip_NON_FULL_URL_text_command|$delete_lines_not_containing_an_URL"
get_domains_command="$insert_NL_before_and_after_URLs_command|$strip_NON_domain_text_command|$delete_lines_not_containing_an_URL"

#Check if any of the necessary utilities is missing:

CheckUtilities find file kill seq ps sort uniq sed grep cat

if [ "$skip_non_text_files_flag" = "0" ]; then
    CheckUtilities --warning unzip tar bzip2 xz gzip 7z rar pdftohtml
fi

warning_msg=""
CheckUtilities --warning-message warning_msg meld
if [ "$warning" = "true" ]; then meld_is_not_installed="true"; else meld_is_not_installed="false"; fi

#Process parameters/flags and check for errors:

if [ ! "$find_parameters_0" = "0" ]; then
    find_parameters="$(for i in $(GenerateSequence 1 $find_parameters_0;); do eval printf \'\%s \' "\'\$find_parameters_$i\'"; done;)"
else
    find_parameters='-name "*"'
fi

if [ "$hidden_files_flag" = "1" ]; then
    hidden_files_string=""
elif [ "$hidden_files_flag" = "0" ]; then
    hidden_files_string="$find_negation"' -path ''"*/.*"'
fi

find_parameters="$hidden_files_string"" -a ""$find_parameters"

GetCurrentShell current_shell_name current_shell_full_path
cd "${0%/*}" 2>/dev/null; current_script_path="$(pwd -P)/${0##*/}"
current_script_path_escaped="$(printf '%s\n' "$current_script_path"|sed "s/'/$Q\"\$Q\"$Q/g")"

if [ "$different_flag" = "0" ] && [ "$common_flag" = "0" ]; then
    PrintErrorMessageAndSetError "ERROR: Expected either -c or -d flag!"
elif [ "$different_flag" = "1" ] && [ "$common_flag" = "1" ]; then
    PrintErrorMessageAndSetError "ERROR: The '-c' flag cannot be used together with the '-d' flag!"
fi

eval find \'/dev/null\' "$find_parameters">/dev/null 2>&1||{
    PrintErrorMessageAndSetError "ERROR: Invalid parameters for the 'find' command!"
}

if [ "$error" = "true" ]; then
    printf "\n"
    PrintErrorExtra
    CleanUp; exit 1;
fi>&2

if [ "$file_parameters_0" = "0" ]; then
    DisplayHelp
else
    #Check if files given as parameters are accessible/readable:
    error="false"
    for i in $(GenerateSequence 1 $file_parameters_0); do
        eval current_file_path=\"\$file_parameters_$i\"
        if [ ! -e "$current_file_path" ]; then
            PrintErrorMessageAndSetError "ERROR: File '$current_file_path' does not exist or is not accessible!"
        elif [ ! -r "$current_file_path" ]; then
            PrintErrorMessageAndSetError "ERROR: File <$i> = '$current_file_path' is not readable!"
        fi
    done
    
    if [ "$error" = "true" ]; then
        printf "\n"
        PrintErrorExtra
        CleanUp; exit 1;
    fi>&2
    
    GetMimeTypes
    
    #Expand file parameters to their full paths and re-store them:
    for i in $(GenerateSequence 1 $file_parameters_0); do
        eval current_file_path=\"\$file_parameters_$i\"
        eval current_mime_type=\"\$mime_types_$i\"
        ExtractFirstAndLastPathComponent current_file_path fpc_current_file_path lpc_current_file_path
        cd "$initial_dir"
        cd "$fpc_current_file_path"
        current_file_path="$PWD/$lpc_current_file_path"
        current_file_path_escaped="$(printf '%s\n' "$current_file_path"|sed "s/'/$Q\"\$Q\"$Q/g")"
        eval file_parameters_$i=\"\$current_file_path\"
        eval escaped_file_parameters_$i=\"\$current_file_path_escaped\"
        if [ "$current_mime_type" = "device" ]; then
            current_stored_file="$(cat "$current_file_path")"
            eval stored_file_parameters_$i=\"\$current_stored_file\"
        fi
        cd "$initial_dir"
    done
    stored_file_parameters_0="$file_parameters_0"
    stored_file_parameters_type_0="$file_parameters_0"
    
    #Proceed to finding and comparing URLs:
    
    IFS='
'
    
    StoreURLsWithLineNumbers
    
    IFS=' 
'
    if [ "$domains_full_flag" = "0" ]; then
        
        if [ "$preserve_order_flag" = "0" ]; then
            { ProcessGroup 1; ProcessGroup 2; }|SortAndFilter1
        else
            if [ "$current_shell_name" = "dash" ]; then
                PrintErrorMessageAndSetError "ERROR: The '--preserve-order' flag makes use of process substitution, which is not available in 'dash' (you can use other \"dash\" syntax compatible shells, like: 'bash', 'zsh', 'ksh' instead)!"
                printf "\n"
                exit 1
            fi
            
            if [ "$meld_is_not_installed" = "false" ] && [ "$use_only_terminal_flag" = "0" ]; then
                eval meld \<\(ProcessGroup 1\|SortAndFilter1\) \<\(ProcessGroup 2\|SortAndFilter1\)
            else
                if [ "$use_only_terminal_flag" = "0" ]; then
                    PrintWarningMessage "WARNING: The 'Meld' utility is not installed - defaulting to printing to 'terminal' instead!"
                fi
                IFS='
'
                URL_count=0
                current_line=""
                for line in $(eval diff \
                        \<\(\
                            count1=0\;\
                            for i in \$\(GenerateSequence 1 $lines11_0\)\; do\
                                count1=\$\(\(\$count1 + 1\)\)\;\
                                eval URL=\\\"\\\$lines12_\$i\\\"\;\
                                printf \"\%s\\n\" \"File group: 1 URL: \$count1\"\;\
                                printf \"\%s\\n\" \"\$URL\"\;\
                            done\;\
                            printf \"\%s\\n\" \"\#\#\# Separator 1\"\;\
                        \) \
                        \<\(\
                            count2=0\;\
                            for i in \$\(GenerateSequence 1 $lines21_0\)\; do\
                                count2=\$\(\(\$count2 + 1\)\)\;\
                                eval URL=\\\"\\\$lines22_\$i\\\"\;\
                                printf \"\%s\\n\" \"File group: 2 URL: \$count2\"\;\
                                printf \"\%s\\n\" \"\$URL\"\;\
                            done\;\
                            printf \"\%s\\n\" \"\#\#\# Separator 2\"\;\
                        \) \
                    ); do
                    URL_count=$(($URL_count + 1))
                    previous_line="$current_line"
                    current_line="$line"
                    #if ( current line starts with "<" and previous line starts with "<" ) OR ( current line starts with ">" and previous line starts with ">" ):
                    if ( [ ! "${current_line#"<"}" = "${current_line}" ] && [ ! "${previous_line#"<"}" = "${previous_line}" ] ) || ( [ ! "${current_line#">"}" = "${current_line}" ] && [ ! "${previous_line#">"}" = "${previous_line}" ] ); then
                        printf '%s\n' "$previous_line"
                    fi
                done
            fi
        fi
    
    elif [ "$domains_full_flag" = "1" ]; then
        
        if [ "$current_shell_name" = "dash" ]; then
            PrintErrorMessageAndSetError "ERROR: The '--preserve-order' flag makes use of process substitution, which is not available in 'dash' (you can use other \"dash\" syntax compatible shells, like: 'bash', 'zsh', 'ksh' instead)!"
            printf "\n"
            exit 1
        fi
        
        #"sc" = "script_command":
        
        # Command to find common domains:
        eval sc1="\"$current_shell_full_path '\$current_script_path_escaped' -c --domains $(for i in $(GenerateSequence 1 $file_parameters_0); do printf '%s ' \'\$escaped_file_parameters_$i\'; done;) -i\""
        
        # URLs that are only in first parameter file (file group 1):
        eval sc2="\"$current_shell_full_path '\$current_script_path_escaped' -d $(printf '%s ' \'\$escaped_file_parameters_1\';) \"/dev/null\" -i\""
        
        # Command to find common domains:
        eval sc3="\"$current_shell_full_path '\$current_script_path_escaped' -c --domains $(for i in $(GenerateSequence 1 $file_parameters_0); do printf '%s ' \'\$escaped_file_parameters_$i\'; done;) -i\""
        
        # URLs that are only in 2..N parameter files (file group 2):
        eval sc4="\"$current_shell_full_path '\$current_script_path_escaped' -d \"/dev/null\" $(for i in $(GenerateSequence 2 $file_parameters_0); do printf '%s ' \'\$escaped_file_parameters_$i\'; done;) -i\""
        
        #Store the output for <command substitutions> one at a a time (syncronously):
        sc1_output="$(eval $sc1 --command1)"
        sc2_output="$(eval $sc2 --command2)"
        sc3_output="$(eval $sc3 --command3)"
        sc4_output="$(eval $sc4 --command4)"
        
        if [ "$preserve_order_flag" = "0" ] || [ "$meld_is_not_installed" = "true" ] || [ "$use_only_terminal_flag" = "1" ]; then
            if [ "$meld_is_not_installed" = "true" ] && [ "$use_only_terminal_flag" = "0" ]; then
                PrintWarningMessage "WARNING: The 'Meld' utility is not installed - defaulting to printing to 'terminal' instead!"
            fi
            if [ "$different_flag" = "1" ]; then
                eval grep \-F \-vf \<\( printf \'\%s\' \"\$sc1_output\"\; \) \<\( printf \'\%s\' \"\$sc2_output\"\; \)
                eval grep \-F \-vf \<\( printf \'\%s\' \"\$sc3_output\"\; \) \<\( printf \'\%s\' \"\$sc4_output\"\; \)
            elif [ "$common_flag" = "1" ]; then
                eval grep \-F \-f \<\( printf \'\%s\' \"\$sc1_output\"\; \) \<\( printf \'\%s\' \"\$sc2_output\"\; \)
                eval grep \-F \-f \<\( printf \'\%s\' \"\$sc3_output\"\; \) \<\( printf \'\%s\' \"\$sc4_output\"\; \)
            fi
        elif [ "$meld_is_not_installed" = "false" ] && [ "$use_only_terminal_flag" = "0" ]; then
            if [ "$different_flag" = "1" ]; then
                eval meld \<\(grep \-vF \-f \<\( printf \'\%s\' \"\$sc1_output\"\; \) \<\( printf \'\%s\' \"\$sc2_output\"\; \)\;\) \<\(grep \-vF \-f \<\( printf \'\%s\' \"\$sc3_output\"\; \) \<\( printf \'\%s\' \"\$sc4_output\"\; \)\;\)
            elif [ "$common_flag" = "1" ]; then
                eval meld \<\(grep \-F \-f \<\( printf \'\%s\' \"\$sc1_output\"\; \) \<\( printf \'\%s\' \"\$sc2_output\"\; \)\;\) \<\(grep \-F \-f \<\( printf \'\%s\' \"\$sc3_output\"\; \) \<\( printf \'\%s\' \"\$sc4_output\"\; \)\;\)
            fi
        fi
    fi
fi

CleanUp

For the asked question - this should do it:

dash '/path/to/the/above/script.sh' '/path/to/file1/containing/URLs.txt'

Note: In case of using the konsole terminal emulator - in order to be able to display a customized (personal) terminal emulator window title - it is required one initial additional step:

Konsole -> Settings -> Configure Konsole ... ->
-> Enable option "Show window title on the titlebar"

Upvotes: 0

K J

Reputation: 11733

I see nobody addressed the comment that windows console commands did not "work"

I tried writing a batch file, and it told me FINDSTR is not an internal or external command

but if we take the OP input at face value (otherwise would need adjusting) then we could use Findstr (has regex) but Find is simpler

type html.txt|find /i "http">temp.txt&for /f "tokens=2 delims==>" %f in (temp.txt) do @echo %f

(for a Windows batchfile those %f will need to be double %%f)

will produce this result which can be redirected or be put in a second line stripped of the surrounding quotes, but unclear if may be needed for a downstream step where double quotes may be required as a Windows preference.

"http://www.drawspace.com/lessons/b03/simple-symmetry"
"http://www.drawspace.com/lessons/b04/faces-and-a-vase"
"http://www.drawspace.com/lessons/b05/blind-contour-drawing"
"http://www.drawspace.com/lessons/b06/seeing-values"

Upvotes: 0

RARE Kpop Manifesto

Reputation: 2819

assuming a well-formed HTML document with only 1 href link per line, here's one awk approach without needing backreferences to regex:capturing groups

{m,g}awk 'NF*=2<NF' OFS= FS='^.*<[Aa] [^>]*[Hh][Rr][Ee][Ff]=\"|\".*$'

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

Upvotes: 0

kvantour

Reputation: 26471

As per comment of triplee, using regex to parse HTML or XML files is essentially not done. Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

In case of parsing XML or HTML, one can easily use xmlstarlet.

In case of an XHTML file, you can use :

xmlstarlet sel --html  -N "x=http://www.w3.org/1999/xhtml" \
               -t -m '//x:a/@href' -v . -n

where -N gives the XHTML namespace if any, this is recognized by

<html xmlns="http://www.w3.org/1999/xhtml">

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
                   -t -m '//x:a/@href' -v . -n
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

Upvotes: 2

Sathish

Reputation: 1475

Use grep to extract all the lines with links in them and then use sed to pull out the URLs:

grep -o '<a href=".*">' *.html | sed 's/\(<a href="\|\">\)//g' > link.txt;

Upvotes: 2

Michael

Reputation: 1151

My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:

$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html

Output: References

file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1

It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).

Upvotes: 7

Ed Morton

Reputation: 203209

$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

Upvotes: 41

fedorqui

Reputation: 289505

You can use grep for this:

grep -Po '(?<=href=")[^"]*' file

It prints everything after href=" until a new double quote appears.

With your given input it returns:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">', you can get rid of the useless use of cat with grep '<a href=".*">' drawspace.txt.

Another example

$ cat a
hello <a href="httafasdf">asdas</a>
hello <a href="hello">asdas</a>
other things

$ grep -Po '(?<=href=")[^"]*' a
httafasdf
hello

Upvotes: 34

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

Answers (8)

Another example

Related Questions