问题描述:

In the below awk I am using $5 $7 and $8 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. When I search for one match use 3 fields for the key for the lookup and do not skip the header I get current output. I apologize for the long post and file examples, just trying to include everything to help get this working. Thank you :).

file1

 Index Chromosomal Position Gene Inheritance Start End Ref Alt Func.refGene

98 48719928 FBN1 AD 48719928 48719929 AT - exonic

101 48807637 FBN1 AD 48807637 48807637 C T exonic

file2

R_Index Chr Start End Ref Alt Func.IDP.refGene

36 chr15 48719928 48719929 AT - exonic

37 chr15 48719928 48719928 A G exonic

38 chr15 48807637 48807637 C T exonic

awk

awk -F'\t' '

NR == FNR {

A[$25]; A[$26]; A[$27]

next

}

{

B[$3]; B[$5]; B[$6]

}

END {

print "Match"

OFS=","

for ( k in A )

{

if ( k && k in B )

printf "%s ", k

}

print "Missing from file1"

OFS=","

for ( k in B )

{

if ( ! ( k in A ) )

printf "%s ", k

}

print "Missing from file2"

OFS=","

for ( k in A )

{

if ( ! ( k in B ) )

printf "%s ", k

}

}

' file1 file2 > list

current output

Match

Missing from file1

A C Ref 48807637 Alt Start T G - AT 48719928 Missing from file2

desired output

Match 48719928 AT -, 48807637 C T

Missing from file1 48719928 A G

Missing from file2

网友答案:

Program 1

This works, except the output format is different from what you request:

awk 'FNR==1 { next }
     FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
     FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
     END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
           print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
           print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
     }' file1 file2

Output 1

Match:
48807637 C T
48719928 AT -
Missing in file1:
48719928 A G
Missing in file2:

Program 2

If you must have each set of values in a category comma-separated on a single line, then:

awk 'FNR==1 { next }
     FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
     FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
     END {
            printf "Match"
            pad = " "
            for (k in file1)
            {
                if (k in file2)
                {
                    printf "%s%s", pad, file1[k]
                    pad = ", "
                }
            }
            print ""

            printf "Missing in file1"
            pad = " "
            for (k in file2)
            {
                if (!(k in file1))
                {
                    printf "%s%s", pad, file2[k]
                    pad = ", "
                }
            }
            print ""

            printf "Missing in file2"
            pad = " "
            for (k in file1)
            {
                if (!(k in file2))
                {
                    printf "%s%s", pad, file1[k]
                    pad = ", "
                }
            }
            print ""
     }' file1 file2

The code is a little bigger, but the format used exacerbates the difference. The change is all in the END block; the other code is unchanged. The sequences of actions in the END block no longer fit comfortably on a single line, so they're spread out for readability. You can apply a liberal smattering of semicolons and concatenate the lines to shrink the apparent size of the program if you desire.

It's tempting to try a function for the printing, but the conditions just make it too tricky to be worthwhile, I think — but I'm open to persuasion otherwise.

Output 2

Match 48807637 C T, 48719928 AT -
Missing in file1 48719928 A G
Missing in file2

This output will be a lot harder to parse than the one shown first, so doing anything automatically with it will be tricky. While there are only 3 entries to worry about, the line length isn't an issue. If you get to 3 million entries, the lines become very long and unmanageable.

网友答案:

You misunderstand awk syntax and are confusing awk with shell. When you wrote:

A[$25] [$26] [$27]

you probably meant:

A[$25]; A[$26]; A[$27]

(and similarly for B[]) and when you wrote:

IFS=

since IFS is a shell variable, not an awk one, you maybe meant

FS=

BUT since you're doing that in the END section and not calling split() and so not doing anything that would use FS idk what you were hoping to achieve with that. Maybe you meant:

OFS=

BUT you aren't doing anything that would use OFS and your desired output isn't comma-separated so idk what you'd be hoping to achieve with that either.

If that's not enough info for you to solve your problem yourself then reduce your example to something with 10 columns or less so we don't have to read a lot of irrelevant info to help you.

相关阅读:
Top